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The standard approach for term frequency normalization is based only on the document length. However, it 
does not distinguish the verbosity from the scope, these being the two main factors determining the docu¬ 
ment length. Because the verbosity and scope have largely different effects on the increase in term frequency, 
the standard approach can easily suffer from insufficient or excessive penalization depending on the specific 
type of long document. To overcome these problems, this paper proposes two-stage normalization by per¬ 
forming verbosity and scope normalization separately, and by employing different penalization functions. In 
verbosity normalization, each document is pre-normalized by dividing the term frequency by the verbosity 
of the document. In scope normalization, an existing retrieval model is applied in a straightforward manner 
to the pre-normalized document, finally leading us to formulate our proposed verbosity normalized (VN) 
retrieval model. Experimental results carried out on standard TREC collections demonstrate that the VN 
model leads to marginal but statistically significant improvements over standard retrieval models. 
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1. INTRODUCTION 

In information retrieval (IR), term frequency is a fundamental and important compo¬ 
nent of a ranking model. Intuitively, the larger the term frequency of a query word in 
a document, the more likely the document is to he about the query topic, and thus, 
the document should have a higher relevance score. In practice, however, documents 
are of various lengths, and the simple approach of preferring documents with higher 
term frequency could easily result in an excessive preference for long documents. To 
use the term frequency in a fairer approach, normalization of the term frequency has 
heen extensively investigated hy researchers. 

With regard to the normalization problem, Robertson and Walker introduced the 
verbosity and the scope hypotheses, which state that doc ument length is mainly deter¬ 
mined by two factors - verbos ity and scope - as follows BRobertson and Walker 19^ 
Robertson and Zaragoza 2009||: 
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1) Verbosity hypo thesis: “Some authors are simp ly more verbose, using more words 
to say the same thing [ Robertson and Zaragoza 2009| .” 

2) Scope hypothesis: “Some authors have more to say: they may write a single 
document containing or covering more ground [ Robertson and Zaragoza 2009) .” 

In this paper, we focus on the difference between the effect of the verbosity and 
the scope on the term frequency of a single word. Verbosity, as the name implies, is 
related to the burstiness of term frequency, which helps an already mentioned word in 
a document get a higher frequency. Even if a word has a low term frequency in normal 
verbosity, its term frequency could increase significantly when the document has high 
verbosity. On the other hand, scope mostly involves the creation of a new word, rather 
than boosting the term frequency. Broadening the scope of a document would help 
unseen words in a normal document get non-zero frequencies. However, these non¬ 
zero frequencies might not be high. Therefore, verbosity leads to a significant increase 
in term frequency, whereas scope leads to a rather limited increase in term frequency. 
In other words, the scope of a document only helps the occurrence of a new word, and 
the term frequency of the word is mostly governed by the verbosity of the document. 

Despite this difference between verbosity and scope, standard normalization is a 
length-driven approach, i.e., it is based only on the document length, without distin¬ 
guishing between verbosity and scope. As a result, it may suffer from insufficient pe¬ 
nalization of a verbose document whose length is increased mainly by high verbosity, 
and excessive penalization of a broad document whose length is mainly derived from 
the broad scope. 

In the light of this addressed difference, this paper argues that verbosity and scope 
should be normalized separately by employing different penalization functions. To 
achieve this, we propose a two-stage normalization approach. We first perform ver¬ 
bosity normalization for each document by linearly dividing the term frequency by 
the verbosity, thus obtaining a verbosity-normalized document representation. We then 
perform scope normalization, in which an existing retrieval model is applied to this 
verbosity-normalized document representation. The final model obtained is called a 
verbosity-normalized (VN) retrieval model. 

Furthermore, we examine whether the proposed VN retrieval model re¬ 
sulting from two-stage normalization performs the desired separate penaliza¬ 
tions. Toward this end, we first select three popular retrieval models - the 
Okapi model [[R obertson et al. 19951 , the Dirichlet-prior (DP) smoothed language 
model [[Zhai and Lafferty 20011 , and the Markov random field (MRF) model 
iMetzler and Croft 200511 - and then perform comparative axiomatic analysis of the 
original and the VN retrieval models, under the setting of the axiomatic framework in¬ 
troduced in [Fang et al. 2004 Fang et al. 2011| . The analysis results confirm that the 
VN model indeed performs the desired separate normalizations, i.e., a strict penaliza¬ 
tion of verbosity-increased documents and a relaxed penalization of scope-broadened 
documents. 

The results of experiments carried out on standard TREC test collections show that 
the VN retrieval models are significantly better than the original models. The experi¬ 
mental results support our motivating argument that the verbosity and scope should 
be handled separately using different penalization functions. 

The remainder of this paper is organized as follows. Section [2] describes previ¬ 
ous studies. Section [3] describes the proposed two-stage normalization approach and 
presents the VN retrieval models for DP, Okapi, and MRF. Section |4] presents the 
main results of the analysis of retrieval models under standard length normalization 
constraints. Sections [5llll[7] present the experimental setting and results. Sections |8] 
concludes. 
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2. PREVIOUS WORK 


Singhal et al. [1996 1 recognized that simply dividing the term frequency hy the doc- 
ument length leads to the over-penalization problem in long documents. To overcome 
this problem, they proposed pivoted normalization, in which a pivoted length is used to 
normalize the term frequency by adding a constant pivot factor (i.e., average document 
length) to the original doc ument length. Pivoted n ormalization had originally been in- 
tro duced in Okapi’s mod el URobertson et al. 199511 . before it was formalized and named 


by Singhal et al. [19961. Because pivoted normalization yields successful results, it 


has been explicitly adopted by other re trieval models, such as the INQUERY sys¬ 
tem BCallan et al. 199^IAllan et al. 2000ll . A similar relaxed type of normalization has 
been commonly used in more recent retrieval model s - normalization 2 in the diver- 
gence from randomness (DFR) retrieval fr amework | Amati and Van Rijsbergen 2002 1 
a nd the smoothed d ocument length in DP |Zhai and Lafferty 2001| . 

Fang et al. [2004| formally and mathematically dehned IR heuristics, 
drawn from ranking characteristics most commonly used by existing re¬ 
trieval models, thereby proposing a novel direction for an axiomatic ap¬ 
proach to IR. The retrieval heuristics defined in the axio matic approach have 
been used to define a ne w retrieval model inductively | Fang and Zhai 2005 
IClinchant and Gaussier 20101 and to restrict the search space for automati- 
cally learning a retrieval function ICummins and O’Riordan 20061 . In addition 
to original constraints, some studies have explored new constraints including: 
semantic term matching constraints ||Fang and Zhai 2006||, the proximity-based 

w 


matchin g constraint llTa o and Z hai 20071 , the burstness-based normalization con¬ 
straint IlClinchant and Gaussier 20101. the document fre quency constraint for 
pseudo-relevance feedback IlClinchant and Gaussier 20111, the feedback term 
weight constraints for pseudo-relevance feedback IlClinchant and Gaussier 20131 , 
and the translation probabil ity constraints for translation language models 
I Karimzadehgan and Zhai 2012|. With regard to the length normalization prob- 


lem, [Fang et al. [2004[ defined three length normalization constraints (referred to as 
LNCl, LNC2, and TF-LNC), demonstrating analytically that popular retrieval models 
satisfy all these normalization constraints at least for content-bearing words. 

Our argument that different normalization f unctions should be used 
for verbosity and scope w as also proposed by URobertson and Walker 19^ 
Robertson and Zaragoza 20091, in a more restricted manner, as follows: “The ver- 


bose hypothesis suggests that we should simply normalize term frequencies by dividing 
by d ocument length, while the scop e hypothesis, on the other hand, suggests the oppo¬ 
site I Robertson and Zaragoza 2009 1.” That is, they suggest that a retrieval function 
does not necessarily penaliz e a long docume nt when it has a broad scope. A similar 
argument was also made by UNa et al. 2008al . Our suggestion, however, is that we still 
need the penalization for scope, but in a much more relaxed manner. In this sense, our 
argument can be regarded as a generalization of the previous arguments. 

To the best of our knowledge, one of the first approaches for two-s tage normalization 
is pivoted unique normalization, suggested by [ Singhal et al. 1996 1. In their approach, 
the term frequency is first normalized on the basis of a nonlinear function by using 
the average term frequency (which corresponds to verbosity normalization), and the 
normalized term frequency is then further divided by a pivoted unique length (which 
corresponds to scope normalization). However, it remains unclear how their approach 
can be generalized to other retrieval models. 

Going beyond the aforementioned existing works, we propose a generalized two- 
stage normalization approach, arguing more clearly that the term frequency should 
be penalized differently, depending on whether a document is long because of the 
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verbosity or the scope. Our approach is not limited to a specific retrieval model or 
a specific measure of the verbosity or scope. We also analytically present the re¬ 
trieval heuristics realized by two-stage normalization, by performing a comparative 
axiomatic analysis under the setting of standard normalization constraints suggested 
by “ 


Fang et al. [2004| , 

; t is noteworthy that 


the Okapi and the DFR retrieval framework 
fAmati and Van Rijsbergen 2002) can be considered as another type o f two-stage 
normalization. According to the derivation by BHe and Ounis 20031 . the first 
step normalizes the term frequency by a relaxed document length using tfn = 
c{w, d)/ (fci ((1 — 6) -I- & • \d\/avgl)) in Okapi and tfn = c{w, d)log (1 -I- c • avgl/\d\) in DFR, 
and the second step further normalizes tfn by tfn/{tfn -I- 1). The first step uses 
the document length, thereby performing a mixed normalization of verbosity and 
scope, and the second step roughly performs verbosity normalization by preventing a 
document with high tfn from getting a very large score. However, this is not the case 
in our approach, which further distinguishes between the verbosity and the scope. 
Interestingly, passage retrieval can als o be viewed as tw o-stage 
DSalton and Buckley 19911 

-- fe 


malization 
Allan 1995[ 


IMittendorf and Schauble 1994[ 


ISalton et al. 19931 


Kaszkiel et al. 1999[ 


ILiu and Croft 2002[ 


nor- 
ICallan 1994[ 


Kaszkiel and Zobel 1997[ 


Bendersky and Kurland 2008t 


Na et al. 2008bl IBendersky and Kurland 2010t ILv and Zhai 2009bt ILv and Zhai 2010 


Krikon et al. 20101 IKrikon and Kurland 2011T . Because scopes are more similar in 


passages themselves than in documents, using passages itself can be considered as a 
type of scope normalization. Thereafter, applying an existing retrieval method to score 
each passa ge corresponds to ver bosit y normalization. 

Recently, |Lv and Zhai [2011cl and |Lv and Zhai [20lfbl observed that when docu¬ 
ments are extremely long, the score gap calculated as the difference between scores 
when a query term is present and when it is absent in a document, could be in¬ 
finitely close to zero or negative. As a result, extremely long documents tend to be 
overly penalized. To ensure a desirable score gap between documents that match and 
do not match a query term, |Lv and Zhai [2011bl proposed lower-bounding term fre¬ 
quency normalization, which can be described as follows: (1) A pseudo score gap be¬ 
tween documents that match and do not match a query term is newly introduced 
as a document-independent factor. (2) For each query term, the pseudo score gap is 
added to the original document score only when the document matches the query 
term, whereas the original document score i s left unchanged for a document that 
does not match the query terrrQ- Importantly, |Lv and Zhai [2011b| | closely examined 
the underlying principles of their proposed normalization, after which they proposed 
the constraints L Bl and LB2 as extensions of the existing formal heuristics used in 
I Fang et al. 2011) . According to their axiomatic analysis, all modified retrieval func- 
tions proposed in BLv and Zhai 20lTbl unconditionally or m ore easily satisfy the lower 
bounds (LBs) without violating the original constraints of | |Fang et al. 20Tl| , whereas 
existing functions do not satisfy the LBs. Experiment results showed that all modified 
retrieval functions showed statistically significant improvements, especially for ver- 
bose queries. In con trast to our work, the lower-bounding normalization proposed in 
BLv and Zhai 20lTbl uses only the document length. However, in our case, we distin¬ 
guish th e verbosity of the d ocument from the scope. In addition, the new constraints 
used in BLv and Zhai 20lTb]| are complementary to the existing length normalization 


^The same scoring function can be equivalently implemented by redefining a within-document scor- 
ing function for bot h cases (i.e., either a document matches a query or it does not), as formulated in 
ILv and Zhai 2011bl . 
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constraints (LNCs), whereas our work emphasizes the need to pursue a new genera¬ 
tion of LNCs. 

2.1. Novel Contributions beyond Our Prior Work 

In UNa et al. 2008all . the initial form of the two-stage normalization approach was pre¬ 
sented to modify language model ing approaches hy introducing the pseudo document 
model. However, BNa et al. 2008all were not aware of the importance of the pseudo doc- 
ume nt model as a gen eralized solution for handling the addressed problem. In addi¬ 
tion, UNa et al. 2008al suggested a rather harsh retrieval constraint called TNC, which 
is too strong to he satisfied hy even their own proposed method. Given the previous pre¬ 
sentation of IINa et al. 2008aL it thus remained unclear how the presented normaliza¬ 
tion yields some of the reported improved performances, and how it can he generalized 
to other retrieval models. Building on our prior work, novel contrihutions of this paper 
are listed in the following: 

— Generalized two-stage n ormalization (Se ction[3ll, which was not explicitly argued and 
not fully formalized in BNa et al. 2008aL With the explicit formulation, we now cor¬ 
rectly understand VN-DP as a specific instance of two stage normalization. 

— Extensions to other models - Okapi and MRF (Section[3]l, as a result of the proposed 
generalized normalization 

— Analytically capturing retrieval heuristics of two-stage normalization hy performing 
comparative axiomatic an alysis (Section|4|& Appendices C, D , and E) under the stan¬ 
dard constraint setting of [Fang et al. 200^[Fang et al. 2oiT| 

— LengthPower as a novel scope measure (Section!^. Using LengthPower, we have an 
unified view of language modeling approaches by considering both JM and DP as 
special cases of VN-DP. 

— Comparison with lower bounding term frequency normalization (SectionO 

3. TWO-STAGE NORMALIZATION 

In this section, we describe our proposed two-stage normalization in detail, and apply 
it to the DP, Okapi, and MRF approaches, as case studies. 

3.1. Verbosity Normalization 

The following are notations commonly used in this paper. 

• V: wi,W 2 , - ■ ■, Set of all words 

• N: Number of documents in a given collection 

• C: A given collection, consisting of di, • • •, dN- Often, we also use C to refer to the 
concatenated representations of all documents in C. 

• df{w): Document frequency of w 

• d (or q): A given document (or a query) 

• c{w, d) (or c{w, q)): Term frequency of word w in document d (or query q) 

• c{w, C): Term frequency of word w in collection C defined by J2dec 

• idf{w): Term discrimination value of w such as IDF 

• |d|: Length of document d, defined by J2wev 

• \C\: Length of collection C, defined by c(w, C) (for brevity of notation, C is ei¬ 

ther the set of documents or the concatenated representation of documents, depending 
on context) 

• s{d): Scope of document d (s(d) < |d| ) 

• v(d): Verbosity of document d 

• avgl,avgv,avgs: Average length, verbosity, and scope, respectively, of documents in 
the collection. 
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Motivated by the verbosity and the scope hypotheses, we first assume that the doc¬ 
ument length is decomposed into the verbosity and the scope, thereby providing the 
following simplified formula: 

|d| = v{d)s{d) (1) 

As a result, we can formulate v{d) in terms of s{d) and |d| as follows: 

»<■'> = I) 

The derivation of Eq. (l2]l is presented in Appendix A. 

In verbosity normalization, the original term frequency is normalized by dividing it 
by the verbosity of the document. To formally describe verbosity normalization, let <p 
be a verbosity normalization operator', (j){d), the verbosity-normalized document repre¬ 
sentation of c0, which is the document transformed by applying the operator (j) to all 
words in a document d; and c{w, (j){d)), the verbosity-normalized term frequency of word 
w. Then, verbosity normalization refers to the process of obtaining c(w, (j){d)) for word 
w, using the following formula: 

c{w,(j){d)) = (3) 

v{d) 

where fc is a verbosity scaling parameter. By substituting Eq. (l2]l into Eq. (l3]l, c{w,<j){d)) 
becomes 

I ,c{w,d)-s{d) 

c{w,(l){d)) = k - 1^1 - 

The resulting normalized term frequency is not only inversely proportional to the doc¬ 
ument length but is also proportional to the scope of the document. 

3.2. Scope Normalization 

For scope normalization, we need to consider a more relaxed function than that for 
verbosity normalization. We first note that the scope of an original document is the 
verbosity-normalized length of the document, as follows: 

md)\ = 2_^ c{w, (l){d)) = k -—-= k ■ s(d) 

mGV I I 

Furthermore, existing retrieval models perform a type of relaxed normalization by 
using their pivoted length or smoothed length. Thus, instead of developing a new func¬ 
tion, we perform scope normalization by straightforwardly applying an existing re¬ 
trieval model to the verbosity-normalized document representation (j){d). Formally, let 
f{d,q) be the original retrieval function that gives a score to d, for query q. Apply¬ 
ing two-stage normalization to f{d,q) gives f{(j}{d),q), which is obtained by replacing 
c{w, d) used in all terms in /(d, q) with c{w, 4>{d)) for all documents in the collection. We 
call f{(j){d),q) a VN (verbosity-normalized) retrieval model or a VN scoring func¬ 
tion. 


^In our notation, the verbosity normalization operator is applied not to document itself but instead to the 
document representation. In this paper, the document representation is assumed to be a vector of its term 
frequencies (and either bigram or proximal term frequencies). For general purposes, the verbosity normal¬ 
ized operator needs to be extended such that it can be applied to advanced document representation such as 
a sequence of words, so that it can be useful for the proximity-based or location-based search. 


ACM Transactions on Information Systems, Vol. 0, No. 0, Article 00, Publication date: 2014. 







Two-Stage Document Length Normalization 


00:7 


3.3. Examples of Verbosity-Normalized Retrieval Models 

In this section, we present the application of two-stage normalization to the DP, Okapi, 
and MRF approaches. 

3.3.1. Dirichlet-pnor(DP). DP pe rforms Bayesian smoothing on a multinomial language 
model IZhai and Lafferty 20011, for which the conjugate prior is the Dirichlet distribu¬ 
tion with the following parameters: 

{npiwi\C), pp{w 2 \C), • • •, pp{w\v\ lO)) (4) 

The Bayesian priors using the parameters of Eq. Q give the following smoothed model 
of document d: 

nr c{w,d) + pp(w\C) 

P{w\(j){d)) = -——- 

d +p 


and the following scoring function for a given query q |Zhai and Lafferty 20011: 


w^qnd ^ 


c(w, d) 
p-p(w\C) 


+ |g| • In 


\d\ +fj. 


The VN model /((/>(d), q) is assumed to employ the following document-specific conju¬ 
gate prior: 

{pv{d)p[wi\C),pLv{d)p{w 2 \C), • • •, pLv{d)p{w\v\\C)) (5) 

In other words, the more verbose d is, the larger is the prior probability used. A detailed 
justification for Eq. ([S]) is presented in Appendix B. These modified Bayesian priors 
using the parameters of Eq. ® give the following smoothed model: 

c(?c, d) + p,v[d)p{w\C) 


P(w\d) = 


( 6 ) 


|d| + pLv{d) 

We simply use fc = 1 in Eq. ([31), because the scaling parameter k of c(w, (/>(d)) is absorbed 
into the smoothing parameter p. Then, Eq. ® becomes 

c(w, (^(d)) + p ■ p{w\C) 


P{w\(j){d)) = 

\(p[d)\ + p 

Eq. ([Til is the same as the equation obtained by replacing c{w, d) with c{w, 4>{d))- 
Using Eq. ®, the resulting retrieval function is given as 

c{w,d) s{d)' 


(7) 


^ c{w,q)ln + — 

iz=/-inw V ^ 


w^qDd 

which is called VN-DF|3- 

3.3.2. Okapi. Okapi’s BM25 
BRobertson et al. 1995L is 


p{w\C) |d| 


\q\ ■ In 


j(d) + p 


retrieval formula, as presented by 


,^eZr,r\rl '' 


j'(fca + 1 )c(w,q) f N - df{w)+ 0.5 

^ -In ' 


fca + c(w, q) 


df{w) + 0.5 


tfBM25{w, d) 


w^qOd 

where the term frequency component tfBM 25 {w, d) is 

(fci + l)c(w,d) 


tfBM2b{w, d) = 


ki ((l-6)+6^) +c(w,d) 


®The formula of VN-DP is equivalent to the modified Dirichlet-prior smoothing suggested by 
INa et aL 2008al . 
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Table I. Feature functions used in the MRF model. 
indicates the number of times that the exact phrase giqi+x oc¬ 
curs in document d, and c^uns{giqi+i,d) indicates the number 
of times that both terms g; and qi+\ appear ordered or unordered 
within a window with a span of 8. 


Feature 

Value 

frid, qi) 

1“ -|rf|+MT 

fo{d, qiqi+i) 

, ,, , ==tti(9i9i+i,r'') ■ 


^ \d\+fj.o 

fu{d, qiqi+i) 




Here, ki, k^, and b are constants. In the VN model, the IDF part is not changed; 
however, tfBM 2 b{w, d) is modified to tfBM 2 b{w, (t>{d)) obtained hy replacing c(tr, d) with 
c('u;, (jrid)), as follows: 


tfBM25{w, i’id)) 


_(fci + l)c{w,(j){d)) _ 

fci ((1 -b) + + c{w, (l){d)) 


As in the case of DP, we assume the scale parameter k to he 1, because it is absorbed 
into fci, resulting in the following final form: 


tfBM25{w,4‘{d)) = 


(fci + l)c(tr, d) 


ki\d\ ({l-h)^+b^^ +c{w,d) 


The modified Okapi function by using tfBM2b{w,4>{d)) foT tfBM 2 b{w,d) is called VN- 

Okapi. 


3.3.3. Markov Random Field (MRF). MRFs are undirected graphical models that are 
used to define joint distr ibutions over a set of random variables. The use of M RFs 
for IR was suggested by IlMetzler and Croft 20051 IMetzler and Bruce Croft 20071 . go¬ 
ing beyond the simplistic bag of words assumption, by explicitly modeling the term 
dependency among query words. Thus far, three different variants of the MRF model 
have been suggested according to the type of dependency assumed among query 
words - full independence, sequence dependence, and full dependence. This pa- 


per focuses on sequence dependence, which has been widely usee 

in many recent 

works IMetzler and Croft 20071 ILease 20091 Bendersky et al. 20101 

Wang et al. 2010 

|Lang et al. 2010t Bendersky et al. 2011|, because of its good balance 
ness and efficiency. 

between effective- 


To formally present the ranking function of the sequential dependence, suppose that 
g is a sequence of m terms gi ■ • • g ^- According to the origin al framework, the relevance 
score of a document d is given by IlMetzler and Croft 20051 


f{d,q) = XT'^fT{d,qi)+Xo ^ fo{d,qiqi+i) + Xu ^ fu{d,qiqi+i) (8) 

qi&q (ji(ji+iGg 


where we have the constraint Xt + Xq + Xu = 1, and frid^qi), fo{d,qiqi+i) and 
fu{d,qiqi+i) are called the feature functions of the term, ordered phrase, and un- 
ordered phrases, respect ively. Table U presents the definition of each feature function 
IMetzler and Croft 200511 . 
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Table II. Verbosity-normalized feature functions used in the VN-MRF 
model. 


Feature 

Value 

fT{4>{d),qi) 

c{Qi>d) , .. c(g,;,C) 

1,, |cj| 

s{d) + fi'jp 

fo{<t>id),qiqi+i) 



, sia) +^^o frri 

s{d)+iio 

fu{(j>{d),qiqi+i) 

' + t c#un8(.9i9x + l .C’) ■ 

v(d) Id 

s(d)+Mt/ 


Following the original framework IMetzler and Croft 2005L we assume that ht, fJ-o, 
and are the same, i.e., ^J-T = fJ-o = fJ-u = unless otherwise stated. We refer to the 
retrieval function in Eq. ® as MRF. 

To derive a VN retrieval model f{(j){d), q) for MRF, we replace the original term fre¬ 
quencies with the verbosity normalized ones. For this purpose, let and 

c^uns,{di(li+i,(t>{d)) he VN ordered and unordered phrase term frequencies for qiqi+i, re¬ 
spectively. Similar to the definition of VN term frequency in Eq. (l3]l, these VN phrase 
term frequencies are defined as follows: 

(9) 

v{d) 

c#uus{q^q^+l,m) = ( 10 ) 

v{d) 

Furthermore, let fT{4>{d), qi), fo{4>{d),qiqi+i), and fu{4>{<Aqiqi+i) he VN feature func¬ 
tions that correspond to original feature functions. Table [ll] describes the definition of 
each VN feature function. As in the case of VN-DP, k is assumed to be 1 in all VN fea¬ 
ture functions, because it is absorbed to pr, qo, or pu- Finally, we obtain the scoring 
function for the VN model f{4){d),q) of MRF as follows: 

f{(t>{d),q) = fT{(l){d),qf) + Xo X! fo{(l^{d),q^q^+i) + Xu ^ %%+i) 

qi&q qiqi+iGq qiqt+iSq 

( 11 ) 

The MRF model using Eq. dllD is referred to as VN-MRF. 

3.4. Scope Measure 

The remaining problem is how to compute the scope of a document s{d). In this study, 
we adopt three different approaches - length power, the number of unique terms, and 
entropy power. 

3.4.1. LengthPower. As mentioned in the introduction, according to the scope hypothe¬ 
sis, the document length is affected by the scope: the broader the scope of a document, 
the longer the document is, when its verbosity is assumed to be fixed. Therefore, the 
document length could possibly be used as a scope measure according to the scope hy¬ 
pothesis. To derive such a length-based measure, suppose that the scope of a document 
is a function of document length, i.e., s{d) = g(|d|). Many variants exist for such a 
function; however, the verbosity and the scope hypotheses help us restrict the possible 
space for g{\d\), given the following two necessary constraints: 

• SCI: Scope g(|d|) is a non-decreasing function of |(i|. 

• SC2: Verbosity |d|/g(|d|) is a non-decreasing function of |c?|. 
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To obtain such a scope meas ure that would satisfy both SCI and SC2, we use Heap’s 
law, which is given as follows [ [Heaps 1978| E|: 

hid) = \df 


where (3 is an additional constant 

The possible range of /3 is 0 < /3 < 1, from SCI and SC2. Otherwise, s{d) (or v{d)) 
violates SCI (or SC2) if /3 < 0 (or /3 > 1). This length-based scope measure lp{d) exactly 
degenerates into the original unnormalized representation, as a special case, when j3 = 
1 and A: = 1, in which case lp{d) = |d|, v{d) = 1, and s{d) = |d|. The scope measure using 
lp{d) is called LengthPower in this paper. 


3.4.2. UniqLength. Another useful scope measure is the number of unique terms u{d), 
defined as | {w\w G d} |. This is reasonable, because a different topic is described using 
a domain-specific vocabulary or named entities. The more unique terms used in a doc¬ 
ument, the larger is the scope of the document. The scope measure u[d) is referred to 
as UniqLength in this paper. 


3.4.3. EntropyPower. The third scope measure is an entropy-based metric. Previously, 
the entropy of a document was used to define the homogeneous measure of a document 
I Bendersky and Kurland 20Q8[ , which corresponds to the opposite concept of scope. 
Another entropy-based metric is the ent ropy power defined by t he exponential of the 
entropy, which was initially exploited in IlKurland and Lee 200511 to construct the doc¬ 
ument structure. We compared the entropy with the entropy power in our preliminary 
experiments and found that the latter outperformed the former because of its similar¬ 
ity to document length or the number of unique terms. Thus, we choose entropy power 
as our entropy-based metric, and it is defined as follows: 


h{d) = 


e^p{-YjwP^iiMd)ln{pmi{w\d))) if\d\ > 1 
0 otherwise 


where Pmi{w\d) is defined by c{w,d)/\d\, which is the maximum likelihood estimation 
(MLE) of the document language model for d. The scope measure h{d) is called En¬ 
tropyPower in this paper. 


4. RETRIEVAL HEURISTICS OF VN RETRIEVAL MODELS 

In order to analytically check how differently the VN method satisfies retrieval con¬ 
straints as compared to the corresponding original model, we present a compar- 
ative axiomatic analysis performed under the retrieval constraints introduced by 
Fang et al. [2004| | B 


4.1. Reference Retrieval Constraints 

As in the approach of IClinchant and Gaussier 20101 . we divide the six standard con- 
straints into two different sets - form constraints (i.e., TFCl, TFC2, and TDC in 
I Fang et al. 2004||) and normalization constraints (i.e., LNCl, LNC2, and TF-LNC 


^The Heaps law predicts the number of unique terms in a document from the document length, i.e., the 
number of unique terms in a corpus increases according to a fc ■ |ci|^ relationship to the document length. 
Because the number of unique terms can be used as a scope measure to indicate how broad the topic of the 
document is, as presented in Section 13.4.21 we use the formula of the Heaps law to approximately predict 
the number of unique terms using only the document length. 

®The original form of Heap’s law is K\d\^, containing the additional parameter k. Here, we assume that k is 
absorbed in k. 

® Note that our goal in this section is to ‘capture’ ret rieval heuristics o f VN retrieval models, but ‘not’ to 
refine or improve the standard retrieval constraints of (Fang et al. 2004). 
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in [Fang et al. 2004 1). The form constraints specify the desirable restrictions on the 
“curve” of a scoring function. Formally, suppose that q consists of a single word w and 
f{4){d),q) is formulated by gjx, y), where x is c{w, d) and y i s idf{w). Then, TFCl, TFC2, 
and TDC IClinchant and Gaussier 2010[|Fang et al. 2011 1 correspond to, respectively: 


dgix,y) ^ _ d'^g{x,y) 


dx 


> 0 


d'^x 


< 0 , 


dg{x, y) 
dy 


> 0 


It can be easily shown that TFCs and TDC are satisfied for all three normalized func¬ 
tions. This is a natural result, because our normalization only linearly transforms the 
term frequency and retains the original model, without any change to the basic con¬ 
cepts of the original model. 

The normalization constraints describe the necessary properties of a retrieval model 
for the case in which document-specific quantiti es such as length , verbosity, and scope 
are different across documents. According to [ Fang et al. 20 Ilf , each normalization 
constraint can be equivalently described by how the score of a document changes after 
applying a perturbation operator to the document. We introduce three perturbation 
operators called PAN, PLS, and PAR that correspond to LNCl, LNC2, and TF-LNC, 
respectively, as follow^ 

1) PAN (Perturbation of Adding Noise Words): PAN is an operator for adding noise 
terms, denoted by Given d, 'll) an {d) is obtained by adding K noise words vi-- -vr 
to d, i.e., 'ipANid) = dvi - ■ ■ vr, where Vi ^ q. When c ?2 = V'AvCdi), Mzl = \di\ + K and 
c(w, d2) = c{w, di) for all w ^ q. 

2) PLS (Perturbation of Length Scaling): PLS is a length scaling operator, denoted 

by ijjLS- Given d, ijjLsid) is obtained by concatenating all query words in d A times 
and by scaling the length of d up to A times. When di = (d2), |di| = A • |d2| and 

c(w, di) = A • c(?c, d2) for all vj € q. Note that the concatenation is only applied to query 
words, not necessarily to non-query words. The non-query words in d might (or might 
not) be kept in V'Ls(d). In the extreme case, all the non-query words do not appear in 
ipLsid), being replaced with other non-query words. 

3) PAR (Perturbation of Adding Relevant Words): PAR is an operator for adding 
a single relevant word, denoted by i/jar- Given d, ijjAR{d) is obtained by appending a 
single query word w G g, i.e., tpARid) = dw ■'w (i.e., the attached number of w is A) 
A times. When di = il)AR{d2), |di| = |d 2 | -I- A, c(w,di) = c(w,d2) -L A for a given single 
word VO & q, and c{w', di) = c(w', d 2 ) for all vo' 7 ^ vo. 


Here, A is a perturbation parameter. LNCl, LNC2, and TF-LNC can now be equiv¬ 
alently described as follows: 

LNCl: If d 2 = ipAN^di), /(di, g) > /(d 2 , g) for A > 1. 

LNC2: If di = i)Ls{d 2 ), /(di, g) > /(d 2 , g) for A > 1. 

TF-LNC: Let g = {w} be a query with only one term w. If di = 'ii)AR{.d 2 ), 
/(di, g) > /(d 2 , g) for A > 1. 


The perturb ation operator PLS for LNC2 is slig htly different from the original ver¬ 
sion of LNC2 [Fang et al. 2004 Fang et al. 2011| |. In the original version, di is fully 
copied to d 2 , miaking them identical. In our PLS, only query words are concatenated 
K times to di, and no further assumption is made about non-query words. There¬ 
fore, PLS is the generalized version of the original operator, including the original 
version as a special case. This generalization does not cause any inc onsistency in the 
known analysis results of LNC2; the analysis results reported in [Fang et al. 2004| 


^ PAN, PLS, and PAR correspond to TN, LV3, and TGI, respectively, in (Fang et al. 201l| 


ACM Transactions on Information Systems, Vol. 0, No. 0, Article 00, Publication date: 2014. 


























00:12 


Seung-Hoon Na 


Table III. Percentages of Ai being satisfied using all non-stopwords in all 
queries from three collections (ROBUST, WT10G, and GOV2) and three 
query types (sk, sv, and Iv). The columns and Af ^ indicate the 

conditions df{w) < N/2 and c{w,d) > \d\p(w\C), respectively. 



ROBUST 

WTlOG 

GOV2 


tUkapi 


4 Ukapi 
'^1 

A°P 

,Ukapi 

^1 

APP 

sk 

99.5% 

99.1% 

98.5% 

96.2% 

99.3% 

95.6% 

sv 

99.4% 

98.3% 

99.1% 

95.1% 

98.9% 

93.5% 

Iv 

99.3% 

98.3% 

99.3% 

95.1% 

99.2% 

92.9% 


Fang et al. 2011) for LNC2 are also still consistently accepted with our PLS operator. 

see the difference more clearly, Algorithm [T] summarizes the detailed description of 
our PLS operator. 


Algorithm 1 The detailed procedure of PLS 


2 

3 

4 

5 

6 

7 

8 
9: 
10 : 
11 : 


Step 1) Given d, we apply the original PLS operator of | Fang et al. 2004) ’ to d to 
obtain d'; d' is obtained by simply concatenating all words m d K times 
Step 2) Given d', ipLsid) is obtained after applying the following procedure: 

Let d' be wi • • • 

Initialize ij^LS (d) as an empty document, 
for i <— 1, |d|' do 

if Wi € q then 

i’Ls{d) ^r- 1pLs{d) Wi 

else 

i’Lsid) -f— ipLs{d) w' iw' ^ q) where w' is randomly chosen from V\q. 

end if 
end for 


4.2. Analysis Results of Normalization Constraints 

4.2.1. Assumption. Before presenting our analysis results of the three normalization 
constraints, we make the following assumption: 

— Al: For any query word w e q,w is assumed to be a content-bearing word (i.e., df{w) < 
N/2, and c{w, d) > |d|p(ii;|C') for any document d in the collection). 

Empirically, Ai holds well in usual cases when we filter out stopwords. Table Hill lists 
the percentage of Ai being satisfied using all non-stopword s in all queries from three 
different collections and three query types. (Refe r to Section ISTTl for a description of the 
collections and query types.) As shown in Table Hill c{w,d) > \d\p{w\C) is satisfied in 
more than 98% of the documents for all query words in ROBUST, more than 95% in 
WTIOG, and more than about 93% in GOV2. The condition df{w) < N/2 is satisfied for 
more than 98% of the query terms. 

4.2.2. Analysis Results. There exist necessary conditions common for all VN retrieval 
models under Ai to be satisfied for each normalization constraint. Table |IV| summa¬ 
rizes the analysis results of the general and the special cases of scope using Length- 
Power and UniqLength for VN retrieval models, relative to the original models 0. 


®Note that the analysis results are obtained from DP and Okapi, not from MRF. For MRF, we do not sepa¬ 
rately carry out axiomatic analysis, since it is not a base model like DP and Okapi, but being an extension 
of a base model (i.e. the scoring function of MRF is defined in terms of the main function of its base model). 
Thus, it is reasonable to assume that the normalization heuristics of MRF will not be significantly different 
from its base model, without separate analysis. 
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Table IV. Analysis results of the original and VN retrieval models 
for three normalization constraints - LNC1, LNC2, and TF-LNC - 
under Ai. 



LNCl 

LNC2 

TF-LNC 

Original |Fang et al. 2004| 

Yes 

Yes 

Yes 

Verbosity-normalized 

(General) 

Cl 

C2 

Ca 

Verbosity-normalized 

(LengthPower) 

Yes 

Yes 

Yes 

Verbosity-normalized 

(UniqLength) 

Cl 

C2 

Yes 

Verbosity-normalized 

(EntropyPower) 

Cl 

C2 

Ca 


Table V. Percentages of Ca being satisfied 
using all non-stopwords in all queries from 
three collections (ROBUST, WT10G, and 
GOV2) and three query types (sk, sv, and 
Iv). 



ROBUST 

WTlOG 

GOV2 

sk 

99.99% 

99.97% 

99.98% 

sv 

99.99% 

99.98% 

99.98% 

Iv 

99.99% 

99.98% 

99.96% 


Table [TVl uses the notations introduced by [ Fang et al. 2004| |, where “Yes” and “Ca^” 
indicate that the corresponding model satisfies the particular constraint in the absence 
of conditions and under particular conditions, respectively. The specific conditions are 

• Cl. v{d 2 ) > v{di) 


• C 2 : s{di) > s{d 2 ) 

• C 3 : K/c{w,d2) > v{di)/v{d2) - 1 

• Ca: 5 (^ 2 ) < {\d2\/c{w,d2)f 


where Ci, C 3 and Ca are sufficient but not necessary conditions to satisfy the partic¬ 
ular constraint. Some derivations of the conditions are given in Appendix C-E. 

As shown in Table 1 1 VI an or iginal method sa tisfies all three constraints uncondi¬ 
tionally under Ai according to I Fang et al. 20041, whereas a VN method requires ad¬ 
ditional conditions that depend on the choice of scope measure. An exceptional case is 
LengthPower, in which all constraints are satisfied unconditionally. 

Among the three constraints, TF-LNC is satisfied under LengthPower and 
UniqLength, the detailed proofs of which are presented in Appendix D. Under En¬ 
tropy Power, TF-LNC is satisfied for almost all query words in our test collection, as 
shown in Table [Vj Ca is satisfied in more than 99.9% of the documents for all query 
words in ROBUST, WTIOG and GOV2. Therefore, we do not explore TF-LNC further 
in this paper. 


4.3. Normalization Heuristics of VN Retrievai Models (Case: UniqLength and EntroyPower) 

In this section, we discuss the retrieval behaviors entailed from the VN method in 
the cases of UniqLength and EntropyPower, with respect to the original method. In 
our discussion, PAN and PLS are further divided into two different types - V-type 
and S-type - which refer to verbosity-increasing and scope-broadening perturbations, 
respectively. The definitions of these types of operators are as follows: 
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(1) V-type perturbation: The operator ^/)(-) is called V-type if the perturhation does 
not increase the scope of the document, i.e., if di = ^’(^ 2 ) and '0 is V-type, s(di) < 
s[d2). 

(2) S-type perturbation: The operator is called S-type if the perturhation does 
not decrease the scope of the document, i.e., if di = '0(^2) and 0 is S-type, s(di) > 
s{d2) 

We then reexamine how the original and VN models satisfy LNCs on V-type and S-type 
PAN and PLS. 

The notable result is that Ci and C 2 correspond to a relaxed penalization of a scope- 
hroadened document, and a strict penalization of a verbosity-increased document, re¬ 
spectively. 

First, we present the first heuristic HI and discuss its derivation from Cp. 

4.3.1. H1: Relaxed penalization of scope-broadened documents. The VN retrieval method 
performs a relaxed penalization of a scope-broadened document after performing PAN. 
(from LNCl) 

To derive HI, we divide PAN into V-PAN and S-PAN. V-PAN denotes verbosity- 
increasing PAN, where K added noise words are covered by the original scope of the 
document. S-PAN denotes the scope-broadening PAN, where K added noise words de¬ 
scribe new contents that are not covered by the scope of the original document. In 
terms of V-type and S-type, V-PAN and S-PAN can be defined as follows: 

(1) V-PAN: V-PAN is a specific type of PAN, being V-type. 

(2) S-PAN: S-PAN is a specific type of PAN, being S-type. 


Suppose that di and ^2 = tpAN{di) are the given documents for LNCl. Then, we can 
show that the VN model often does not penalize ^2 for S-PAN; instead, it penalizes 
d 2 for V-PAN. On the other hand, the original model always penalizes ^2 for both S- 
PAN and V-PAN. Thus, the VN model imposes a type of relaxed penalization to scope- 
broadened documents after PAN, with respect to the original model. 

Equivalently, the heuristic HI can be rewritten in the form of a retrieval constraint 
as follows: 

Hl-LNC: If d 2 = ipANidi) and t/jAN is S-PAN, f{di,q) < f{d 2 ,q) for K > 1 with the 
following condition C 5 and Cq, for VN-Okapi and VN-DP, respectively: 

• C5: 


• Cg: 


v{di) - v{d 2 ) ^ b 1 
K ~ 1 — b avgs 


( 12 ) 


v{di) - v{d 2 ) ^ 1 p{w\C)-\-pmiiw\d)s{di)p ^ 

K ~ s{di) pmi{w\d) - p{w\C) 


(13) 


where p{w\d) > p{w\C) is additionally assumed in CeH. 

Compared to LNCll, the consequence part of Hl-LNC conditionally entails the nega¬ 
tion of LNCl, implying that the VN model often prefers some of the s cop e-broadened 
documents resulting from S-PAN, although the original model does notEj- 


®For Ce, assuming an extreme case where the query is a very highly topical, i.e., p„,;(ru|(i) ^ p{w\C) or 
r —)• cx) ), Ce is simplified as: 

vidi) - v{d2) ^ 

K - p 

i°From Eq. 112) and Eq. )13h when {v{di) — v(d 2 )) (K is sufficiently large, C 5 (or Cg) can be satisfied both 
for VN-Okapi and VN-DR This case can appear if v{di) is large and s{di) is small, but it is not always 
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• Example of HI : Here, we present examples of S-PAN and V-PAN. Suppose that 
we use UniqLength as the scope measure and a document consisting of passages that 
are disjoint in scope. Formally, let g, h, and x be passages, and assume that g, h, and 
X have no common or overlapping content, where g denotes a relevant passage and 
h and x are non-relevant, i.e., c{w,g) > 0, c{w,h) = c{w,x) = 0 for query word w € q. 
Examples of S-PAN and V-PAN are as follows: 

Example of S-PAN: 
di=gghh 
d 2 =gghhx 

Example of V-PAN: 

di=gghh 

d2=gghhh _ 

For both examples, the query relevant content is not changed after PAN. Because 
these two examples are PAN examples, the original method always prefers di to d 2 , 
irrespective of the PAN type. However, Ci is satisfied only for V-PAN, because 5 (^ 2 ) = 
s(di) and v{d 2 ) > v{di), and not clearly for S-PAN, because v{d 2 ) < v{di) is plausible 
due to 5 (^ 2 ) > s{di). Therefore, the VN method prefers d 2 in V-PAN and not always in 
S-PAN. 

• Derivation of HI : To show how the VN model behaves differently toward V-PAN 
and S-PAN, we first rewrite Ci by 5 (^ 2 ) — s(di) < K/v{di), implying that the scope of 
the new document d 2 must not be increased considerably after performing PAN. 

i) Case: V-PAN 

First, V-PAN does not increase 5 (^ 2 ) according to the definition of a V-type pertur¬ 
bation, resulting in 5 (^ 2 ) — s{di) < 0. As a result, it is clear that Ci is always true for 
V-PAN, finally making LNCl true. Thus, for V-PAN, there is no difference between the 
original and the VN models in satisfying LNCl. 

For example, suppose that we use UniqLength as the scope measure and consider 
a given V-PAN in which all K words already occur in di. In this case, s(c? 2 ) = s{di) 
because no new words occur in c? 2 - Thus, Ci is equivalent to s{d 2 ) — s(di)= 0 < K/v(di), 
which is true irrespective of K. 

ii) Case: S-PAN 

Second, S-PAN increases the scope after performing PAN according to the definition 
of an S-type perturbation, resulting in s{d 2 ) - s{di) > 0. Therefore, Ci is not always 
true. 

For example, suppose that we use UniqLength as the scope measure again, and 
consider a given S-PAN in which all K words are new and different from each other. 
In this case, 5 (^ 2 ) = s(di) -L K, and Ci is equivalent to AT < K/v{di)', however, Ci is 
usually not satisfied because v{di) > 1. 

Instead, it often satisfies the negation of LNCl. Consider the same S-PAN example 
in which all K words are different from each other and assume that we use VN-DP as 
an example retrieval model. Cq is then equivalent to: 

v[di) - 1 ^ p{w\C) + p„,i{w\d)s{di)p.-'^ 

l + K/s{di)~ pmi{w\d) - p{w\C) 

There exist a number of situations in which Cq is true according to Eq. ( HSU (i.e., 
v{di) is sufficiently large, or K added words are highly topical (p(ic|d) 3> p{w\C)) and p 


true. Otherwise, C5 (or Cq) can be satisfied, according to the choice of a retrieval parameter value or a term 
discrimination value of a query word; for VN-Okapi, Cs is satisfied if b is sufficiently large; for VN-DP, Ce is 
satisfied if p is sufficiently large and the query word is highly topical. 
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is reasonably large). Therefore, for S-PAN, the VN model often does not satisfy LNCl, 
in contrast to the original model that always satisfies LNCl □. 

Next, we present the heuristic H2 and discuss its derivation from C2'- 

4.3.2. H2: Strict penalization of verbosity-increased documents. The VN retrieval method 
imposes a strict penalization of a verbosity-increased document after performing PLS 
(from LNC2). 

As was performed on PAN, we divide PLS into V-PLS and S-PLS. V-PLS denotes 
verbosity-increasing PLS, where the non-query words after PLS is performed are cov¬ 
ered by the original scope of the document, thereby increasing verbosity. S-PLS de¬ 
notes the scope-broadening PLS, where the non-query words after PLS is performed 
introduce new contents that are not covered by the scope of the original document. In 
terms of V-type and S-type, V-PLS and S-PLS can be defined as follows: 

(1) V-PLS: V-PLS is a specific type of PLS, being V-type. 

(2) S-PLS: S-PLS is a specific type of PLS, being S-type. 

Given two documents di = 'ifLs{d 2 ) and d 2 for LNC2, from the definition of C 2 (i.e., 
s{di) > s(c? 2 )), LNC2 is satisfied only if the scope of the original document increases 
after PLS is performed. Therefore, the VN model prefers (or does not penalize) only 
di for S-PLS because it increases the scope; instead, it penalizes di for V-PLS, which 
decreases the scope. As such, the VN model imposes a strict penalization of a verbosity- 
increased document after PLS. 

Equivalently, the heuristic H2 can be rewritten in a form of retrieval constraint as 
follows: 

H2-LNC: If di = tpLsid2) and V'ls is V-PLS, f{di,q) < f{d2,q) for K > 1. 

Compared to LNC2, the consequence part of H2-LNC is the negation of that of LNC2, 
implying that the VN model always penalizes verbosity-increased documents resulting 
from S-PLS, although the original model does not (i.e, prefers them). 

• Example of H2: We present examples of S-PLS and V-PLS. Suppose that we use 
UniqLength as the scope measure and a document consisting of passages that are 
disjoint in scope. Formally, let^, h, x, and yi be passages with equal length (i.e., |^| = 
\h\ = \x\ = \yi\) and unit scope (i.e., s{g) = s{h) = s{x) = s{yt) = 1), and assume that^, h, 
X, and yi have no common or overlapping content, where g denotes a relevant passage 
and h, x, and yi are non-relevant, i.e., c{w,g) > 0 , c(w,h) = c(w,x) = c{w,yi) = 0 for 
query word w € q. Examples of S-PLS and V-PLS are as follows: 

Example of S-PLS: 

di = g g h x yi y2 
d2=g hx 

Example of V-PLS: 

di=gghhhh 

d2=ghx 

For both examples, |d 2 | = 2|di|, c{w,di) = 2 c{w,d 2 ) for w € q, i.e., the query-relevant 
content is copied twice. The example of S-PLS introduces two new non-relevant pas¬ 
sages 2/1 and y 2 that are not given in d 2 , whereas the example of V-PLS does not intro¬ 
duce any new passage but only repeats the previously mentioned non-relevant passage 
h. 

Because these two examples are PLS examples, the original method always prefers 
di to d 2 , irrespective of the PLS type. However, C 2 is satisfied only for S-PLS and not 
for V-PLS, from s{di) = 5 > 3 (^ 2 ) = 3 in S-PLS and s(di) = 2 < s(d 2 ) = 3 in V-PLS. 
Therefore, the VN method prefers di only in S-PLS and not in V-PLS. 
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Table VI. The summary of the normalization behaviors of the original and VN models for 
four perturbations - S-PAN, V-PAN, S-PLS, and V-PLS - under Ai in which d is original 
document, and ipANid) (or ipLs{d)) is the perturbed documents of d after PAN (or PLS). 


Ip 

Verbosity-normalized model 
(UniqLength, EntropyPower) 

Original model 

S-PAN 

f{d, q) < /(^AJv(d), q) if Csfor Ce) is true 

f{d, q) > fii^ANid), q) Otherwise 

f{d,q) > f{ipAN(d),q) 

V-PAN 

f[d, q) > f(ipAN(d),q) 

S-PLS 

f{d, q) < f (tpLs{d),q) 

fid,q) < f{ipLs{d),q) 

V-PLb 

f(d,q) > f iipLs(d),q} 


• Derivation of H2: From the definitions of V-type and S-type perturbations, it is 
trivial to show that C 2 is true for V-PLS but false for S-PLS. Therefore, for S-PLS, 
the VN model does not satisfy LNC2, in contrast to the original model which always 
satisfies LNC2 □. 

4.3.3. Summary. Table IVll summaries the normalization behaviors of the original and 
VN models in response to S-PAN, V-PAN, S-PLS, and V-PLS. 

For V-PAN and S-PLS, the VN model leads to the same normalization heuristics 
as those of the original model. For S-PAN and V-PLS, however, the normalization be¬ 
haviors are completely different between the original and the VN models; for S-PAN, 
the VN model often does not penalize the new document, whereas the original model 
always penalizes it; for V-PLS, the VN model always penalizes the new document, 
whereas the original model does not penalize it. 

Overall, the normalization heuristics entailed from the VN model are dependent on 
whether a perturbation is V-type or S-type. For a V-type perturbation, the VN model 
imposes a strict penalization of a verbosity-increased document (i.e., entailing HI), ir¬ 
respective of PAN or PLS. For an S-type perturbation, the VN model unlikely penalizes 
a scope-broadened document (i.e., entailing H2). On the other hand, the normalization 
heuristics of the original model are dependent on whether a perturbation is PAN or 
PLS, not on whether it is V-type or S-type. For PAN, the original model penalizes a 
new document, irrespective of whether the document is verbosity-increased or scope- 
broadened. For PLS, it does not penalize the new document. 

5. EXPERIMENTAL SETTING 
5.1. Experimental Setup 

For evaluation, we used three different standard TREC collections - ROBUST, WTIOG, 
and GOV2. Table rvU lists the basic statistics of each test collection, where NumDocs 
is the number of documents, NumWords is the total number of word occurrences in 
each collection, TopicSet is the range of topic numbers used for training and testing, 
and Avg of |d|, h{d), and v{d) indicates the average length, entropy power, and ver- 
bositjO, respectively, in a given collection. CoeffVar is the corresponding coefficient of 
variance, which is defined as the ratio of the standard deviation to the mean. The 
interesting statistic is CoeffVar of v{d), which indicates the differences among the ver¬ 
bosities of documents in a collection. ROBUST has the most similar verbosities across 
documents, whereas GOV2 has the most different verbosities. This is because many 
documents in ROBUST are newspaper documents, for example, from Financial Times 
and Los Angeles Times, which are more homogeneous collections. In contrast, the web 
documents in GOV2 are more heterogeneous. 

All experiments were performed using the Lemur toolkit (version 4.12). We carried 
out standard preprocessing by applying the Porter stemmer and removing stopwords 


i^Entropy power is used as scope measure for v(d). 
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Table VII. Statistics of each test collection. CoeffVar denotes 
the coefficient of variation. 


Seung-Hoon Na 


Statistics 

robust 

WTlOG 

GOV2 

NumDocs 

528,156 

1,692,096 

25,205,179 

Num Words 

572,180 

6,346,858 

40,002,579 

TopicSet 

Q301-450 
Q601-700 

Q451-550 

Q701-850 

Avgof |d| 

233.34 

400.25 

690.8 

(CoeffVar) 

(2.39) 

(6.06) 

(2.86) 

Avg of h{d} 

107.77 

109.60 

109.85 

(CoeffVar) 

(0.81) 

(1.45) 

(0.98) 

Avg of v(d) 

1.77 

2.95 

6.11 

(CoeffVar) 

(0.91) 

(5.51) 

(7.17) 


from the standard INQUERY stoplist BAllan et al. 2000L To cover different types of 
queries, we follow the setting used in |Zhai and Lafferty 20011, where four combina¬ 
tions are used: short keywords (sk, title), short verbose (sv, description), long keywords 
(Ik, concept), and long verbose (Iv, title, description, and narrative). In our test topic 
sets, because Ik is not available, the other three types were examined. We use MAP 
(mean ave rage precision) a nd P@5 (precision at top 5 documents) as the evaluation 
measures ICroft et al. 2009L 

For each query, our evaluation is based on the top 1,000 documents retrieved. We 
also report significance test results by a non-directional paired t-test at 0.95 confidence 
level. For the significance test, we use all per-topic performances in a collection, i.e., the 
number of performance difference samples used for the t-test is the same as the total 
number of topics in a given collection Fd. 


5.2. Parameter Tuning 

Several tuning parameters are present in the retrieval methods — DP: p and Okapi: b, 
ki, and fca. Given a test topic set consisting of 50 queries, each parameter was tuned 
using the other topic sets in the same test collection as the development seQ The 
search space for each parameter is given as follows: 

• p: { 100, 200, 300, 400, 500, 600, 800, 1000, 1500, 2000, 2500, 3000, 4000, 5000, 
7000,10000, 15000, 20000 } 

• b: {0, 0.001, 0.003, 0.005, 0.007, 0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 
0.8, 0.9} 


In Section lSi^ we introduce a K-fold cross-validation to avoid the optimization of the retrieval parameters 
to the test set. However, note that we do not use per-fold performances to perform the significance test but 
simply use all per-topic performances. To the best of our knowledge, this type of significance test is an IR- 
specific setting that is different from the other types of significance test used in non-IR literatures. 

i^Here, a topic set consists of 50 queries, which was were created in each year of TREC. For exam¬ 
ple, in ROBUST, as 250 queries are available, there are 5 topic sets, namely, TREC6(Q301—Q350), 
TREC7(Q351-Q400), TREC8(Q401-Q450), ROBSUT03(Q601-Q650), and ROBUST04(Q651-Q700). Pa¬ 
rameters used when testing 50 queries in each topic set are trained using the other 200 queries in 
other topic sets as training data. In other words, for testing 50 queries in TREC6, queries Q351—Q450 
and Q601—Q700 in TREC7, TREC8, ROBUST03, and ROBUST04 are used as training data. For testing 
queries in TREC7, queries in TREC6, TREC8, ROBUST03, and ROBUST04 are used as the training set, 
and so on. Therefore, for ROBUST, we use a five-fold cross validation for parameter tuning, whose folds 
are fixed. For WTIOG, where 100 queries are used, we have two topic sets, namely TREC9(Q451—Q500) 
and TREC10(Q501—Q550). For testing 50 queries in TREC9, we use queries in TRECIO as the train¬ 
ing set, and vice versa. Thus, for WTIOG, we use a two-fold cross validation for parameter tuning. Sim¬ 
ilarly, for GOV2, 150 queries are available, so we have three topic sets, namely TREC2004(Q701—Q750), 
TREC2005(Q751—Q800) and TREC2006(Q801—Q850). For testing 50 queries in TREC2004, queries in 
TREC2005 and TREC2005 are used as the training data. Thus, for GOV2, we use a three-fold cross val¬ 
idation for parameter tuning. 
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Table VIII. MAP performance comparison of DP and VN-DP on three 
collections ROBUST, WT10G, and GOV2; three different query types 
sk, sv, and Iv; and three different scope measures LengthPower (/3), 
UniqLength, and EntropyPower. The row titled “baseline” indicates 
the original model. The symbols * indicate that a run of the VN method 
shows statistically significant improvement over the baseline in the t- 
test, at 0.95 confidence level. 



Method 

DP (or VN-DP) 

ROBUST 

WTIOG 

GOV2 

sk 

baseline 

0.2447 

0.1963 

0.2907 

LengthPower(0.25) 

0.2252 

0.1649 

0.2403 

LengthPo wer(0.5) 

0.2401 

0.1953 

0.2823 

LengthPower(0.75) 

0.2457 

0.1968 

0.2930 

LengthPower(0.9) 

0.2460* 

0.1963 

0.2913 

UniqLength 

0.2472* 

0.2046 

0.3055* 

EntropyPower 

0 . 2481 * 

0 . 2120 * 

0 . 3099 * 

sv 

baseline 

0.2260 

0.1909 

0.2455 

LengthPower(0.25) 

0.2312 

0.1790 

0.2350 

LengthPo wer(0.5) 

0 . 2443 * 

0.2103* 

0.2633* 

LengthPower(0.75) 

0.2396* 

0.2044* 

0.2569* 

LengthPo werfO. 9) 

0.2319* 

0.1946* 

0.2487* 

UniqLength 

0.2385* 

0.2109* 

0.2671* 

EntropyPower 

0.2440* 

0 . 2196 * 

0 . 2826 * 

Iv 

baseline 

0.2707 

0.2469 

0.2864 

LengthPower(0.25) 

0.2697 

0.2249 

0.3060* 

LengthPo werfO. 5) 

0.2765* 

0.250(5 

0.3133* 

LengthPower(0.75) 

0.2762* 

0.2532* 

0.3005* 

LengthPower(0.9) 

0.2725* 

0.2501* 

0.2914* 

UniqLength 

0.2759* 

0.2553* 

0.3083* 

EntropyPower 

0 . 2799 * 

0 . 2614 * 

0 . 3248 * 


• fci: {0.25, 0.3, 0.4, 0.5, 0.6, 0.8, 1.0, 1.2, 1.5, 1.8, 2.0, 2.5, 3.0} 

• ks: fixed at 1,000 

In our preliminary experiments, we found that LengthPower for s{d) can suffer from 
the parameter scaling problem, in which the optimal parameter ranges of and ki in 
the VN methods differ from the known ranges of the original model. For instance, when 
P = 0.25, it was found that ^ was optimal at a value of less than 100, which is beyond 
the normal parameter range. To resolve the scaling problem, we substitute avgv for 
k, instead of setting k to 1, such that c{w, 4>{d)) would become c(tu, d) on average. This 
consideration leads to the following parameter scaling: 

fci ^ /ci ■ avgv~^, fj, fj,- avgv~^ 

This parameter scaling is applied only to LengthPower, and not to the others. No such 
parameter scaling problem occurs in the case of UniqLength and EntropyPower. 

6. EXPERIMENTAL RESULTS 

This section reports the comparative results of the original and the VN retrieval 
method for Okapi, DP and MRF. 

6.1. DP vs. VN-DP 

Table IVini show the comparative results (MAP) of DP and VN-DP under three different 
scope measures, UniqLength, EntropyPower, and LengthPower(/3), which are denoted 
as (/ 3 (d), u{d), and h{d), respectively. 

Generally, it is observed that VN-DP improves original DP. These improvements 
are statistically significant for almost all test collections and all query types (for both 


ACM Transactions on Information Systems, Vol. 0, No. 0, Article 00, Publication date: 2014. 




































00:20 


Table IX. P(3)5 performance comparison of DP and VN-DP for three 
collections - ROBUST, WT10G, and GOV2. 


Seung-Hoon Na 



Method 

DP (or VN-DP) 

ROBUST 

WTIOG 

GOV2 

sk 

baseline 

0.4924 

0.3120 

0.5678 

LengthPo wer(0.5) 

0.4707 

0.3360 

0.5409 

LengthPowerlO. 7 5) 

0.4851 

0.3220 

0.5611 

LengthT’owerlO.a) 

0.4924 

0.3080 

U.5byi 

UniqLength 

0.4956 

0.3620* 

0.5906 

EntropyPower 

0.4972 

0 . 3640 * 

0 . 6416 * 

sv 

baseline 

0.4466 

0.3880 

0.5208 

LengthPo wer(0.5) 

0.4811* 

0.4000 

0.5383 

LengthPower(0.75) 

0.4699* 

0.3960 

0.5409 

LengthPower(0.9) 

0.4530 

0.3820 

0.5275 

UniqLength 

0.4755* 

0.4060 

0.5826* 

EntropyPower 

0 . 4932 * 

0 . 4300 * 

0 . 6309 * 

Iv 

baseline 

0.5414 

0.4460 

0.6228 

LengthPo wer(0.5) 

0.5518 

0.4660 

0.6188 

LengthPower(0.7 5) 

0.5510 

0.4560 

0.6295 

LengthPower(0.9) 

0.5526* 

0.4520 

0.6282 

UniqLength 

0.5542* 

0.4560 

0.6456* 

EntropyPower 

0 . 5631 * 

0.4700 

0 . 6644 * 


UniqLength and EntropyPower), often resulting in an improvement of 10%. The im¬ 
provement tends to be larger on Web collections (i.e., WTIOG and GOV2) than for 
ROBUST. A possible reason is that the Web collections have higher CoeffVar of v{d) 
because of the heterogeneity of documents, and thus, they could gain more from our 
verbosity normalization. 

Among the three scope measures, EntropyPower is the best, and it outperforms 
UniqLength and LengthPower for most topic sets. UniqLength is slightly better than 
LengthPower; however, the difference in their performances is not significant. When 
the best /3 value is adopted for each query type, LengthPower can often show perfor¬ 
mance similar to that of UniqLength. 

Interestingly, VN-DP leads to significant improvements, more in verbose queries 
(i.e., sv and Iv) than in keyword queries. For EntropyPower, VP-DP causes an improve¬ 
ment of 1.55% in ROBUST for short keyword queries, 8.2% in WTIOG, and 6.64% in 
GOV2. The corresponding improvements are much larger for short verbose queries, 
being 8.05% in ROBUST, 16.66% in WTIOG, and 15.11% in GOV2, and they are also 
large for long verbose queries. Restricting our discussion to DP, these results strongly 
support that the use of heuristics HI and H2 is indeed important, especially for ver¬ 
bose queries. 

In addition. Table [IS shows the performances of P@5 for VN-DP, as compar ed to 
that of DP, based on the MAP-optimized free-parameters’ values used in Table IVIIII 
One reason for using the same retrieval parameters instead of directly optimizing P@5 
is that the concavity of the performance curve was smoother in MAP than in P@5, 
the reby avoiding the use of far-from-optimal parameter values. As similarly mentioned 
by IIKurland and Lee 200911 . this choice helps us to examine whether the improved per¬ 
formance in MAP causes severe degradation or significant improvement in the preci¬ 
sion, which is an often important metric in some IR applications such as Web search. 

Under EntropyPower, the improvement of P@5 from DP to VN-DP is significant in 
most cases, often being larger than that of MAP for short keyword and verbose queries. 
VP-DP improves over DP by about 16.67% on WTIOG and about 13.00% on GOV2 for 
short keyword queries, and 13.00% on WTIOG and 21.14% on GOV2 for short verbose 
queries. Exceptional cases are found for keyword queries on ROBUST and for long 
verbose queries on WTIOG, where the improvement in the precision is not statistically 
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Table X. MAP performance comparison of Okapi and VN-Okapi for 
three collections, ROBUST, WT10G, and GOV2. The symbols * in¬ 
dicate that a run of the VN method shows statistically significant im¬ 
provement over the baseline in the t-test at 0.95 confidence level. 



Method 

Okapi (or VN-Okapi) 

ROBUST 

WTIOG 

GOV2 

sk 

baseline 

0.2444 

0.1946 

0.2920 

Lengthb’owerCO.S) 

0.2451 

0.1957 

0.2897 

LengthPower(0.75) 

0.2454 

0.1994* 

0.2923 

LengthPower(0.9) 

0.2452 

0.1944 

0.2923 

UniqLength 

0 . 2483 * 

0.1997 

0 . 3035 * 

EntropyPower 

0.2477* 

0 . 2071 * 

0.3004* 

sv 

baseline 

0.2247 

0.1853 

0.2498 

LengthPo wer(0.5) 

0.2279* 

0.1806 

0.2527 

LengthPower(0.75) 

0.2263 

0.1872 

0.2530* 

LengthPower(0.9) 

0.2271* 

0.1878 

0.2529* 

UniqLength 

0.2267 

0.1936 

0 . 2607 * 

EntropyPower 

0 . 2303 * 

0 . 1968 * 

0.2599* 

Iv 

baseline 

0.2619 

0.2344 

0.3012 

LengthPo wer(0.5) 

0.2647* 

0.2307 

0.3022 

LengthPower(0.75) 

0.2640* 

0.2341 

0.3009 

LengthPower(0.9) 

0.2631 

0.2366 

0.3018 

UniqLength 

0 . 2663 * 

0.2368* 

0.3063* 

EntropyPower 

0.2659* 

0 . 2415 * 

0 . 3074 * 


significant. Therefore, at least for EntropyPower, the results imply that the significant 
improvements of MAP using VN-DP are caused by the increased performance in P@5. 
For UniqLength, however, the impact of P@5 and its contribution to MAP is less clear 
than that for EntropyPower, although some noticeable improvements are observed. For 
LengthPower, most improvements of P@5 are not statistically significant. This implies 
that performance metrics other than P@5, such as recall, might be the major factors 
causing the significant improvement in MAP for LengthPower. 

For further comparison. Figure [1] shows the performance curves of the original DP 
and VN-DP using EntropyPower, plotted by varying /i for MAP and P@5. For both 
measures, VN-DP is always better than DP, for almost all values and in all test 
collections and query types. The shapes of the curves of DP and VN-DP are similar, 
and the optimal ranges of ^ are also fairly similar. This similarity between DP and 
VN-DP is also observed in the case of UniqLength, although we do not present the 
curves for this scope measure here, in the interest of conciseness. 


6.2. Okapi vs. VN-Okapi 

Table IXl show the comparative results (MAP) of Okapi and VN-Okapi under three dif¬ 
ferent scope measures - UniqLength, EntropyPower, and LengthPower(/3). 

As the results show, VN-Okapi gives improvements; however, the magnitude of these 
improvements is smaller than that in the case of VN-DP. One possible reason for the 
smaller improvement is the approximate two-stage normalization carried out in Okapi, 
as discussed in Section [2j As such, a form of verbosity normalization is performed by 
Okapi, to some extent, using the component tfn/(tfn -I-1), which causes VN-Okapi to 
have only a limited effect on retrieval performance. 

Unlike in the case of DP, there is no significant difference between improvements 
for short keyword queries and verbose queries. Therefore, the argument made for DP 
wherein H2 is important, particularly for verbose queries, is much weaker for Okapi. 
Again, this is because Okapi has its own component tfn/{tfn -I-1) that performs a form 
of normalization of verbose documents. As such, excessive preference for a verbose 
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(a) ROBUST (MAP) 





(c) WTIOG (MAP) (d) WTIOG (P@5) 




(e) GOV2 (MAP) (f) GOV2 (P@5) 


Fig. 1. Performance curves (MAP and P@5) of DP and VN-DP obtained using EntropyPower with varying 
fj, in ROBUST (top), WTIOG (center), and GOV2 (bottom). 


document is handled to some extent by the original model, even without our explicit 
verbosity normalization. 

The comparison results for three scope measures are also somewhat different from 
those of DP. In VN-Okapi, there is no winning scope measure between EntropyPower 
and UniqLength; in most cases, both have similar performance. 
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Table XI. The comparison of the best performance results for 
Okapi and VN-Okapi using EntropyPower, and the correspond¬ 
ing parameter values {b and fci). The symbols * indicate that a 
run of the VN-Okapi shows statistically significant improvement 
over the baseline in the t-test at 0.95 confidence level. 




robust 

WT10(4 

(:ioV2 

sk 

Okapi 

0.2454 

0.2033 

0.2920 

(0.3, 0.6) 

(0.3, 0.6) 

(0.01,0.5) ■ 

VN-Okapi 

0.2482* 

0.2107 

0.3018* 

(0.1, 0.5) 

(0.05, 0.3) 

(0.1, 0.3) 

sv 

Okapi 

0.2267 

0.1935 

0.2515 

(0.5, 1.0) 

(0.5, 2.0) 

(0.02, 0.6) 

VN-Okapi 

0.2303* 

0.2001 

0.2618* 

(0.3, 0.6) 

(0.3, 1.5) 

(0.2, 0.6) 

Iv 

Okapi 

0.2637 

0.2385 

0.3012 

(0.8, 0.8) 

(0.5, 1.5) 

(0.03, 0.5) 

VN-Okapi 

0.2676* 

0.2482 

0.3074* 

(0.4, 0.6) 

(0.3, 0.8) 

(0.3, 0.5) 


Despite the limited effects, the improvements obtained by VN-Okapi are statistically 
significant, at least for either UniqLength or EntropyPower, for most of the collections 
and query types, and thus, they indicate the merit of our two-stage normalization. 

For further comparison. Table IxH presents the best MAPs for Okapi and VN-Okapi 
and their corresponding parameter values of b and ki for all three test collections and 
three query types. 

A comparison of the optimal ranges of b across collections for both methods indicates 
that VN-Okapi tends to be robust without significant differences across collections, 
whereas Okapi has poor robustness with the optimal values of b being different be¬ 
tween GOV2 and other collections. More specifically, for Okapi, the performance sur¬ 
faces on GOV2 are shifted in the decreasing direction of b, relative to those of other col¬ 
lections. As a result, the best performance values of b become much smaller on GOV2 
than on other collections for each query type; for short keyword queries, the best value 
of b is 0.01 on GOV2, and this is smaller than the value of 0.3 on other collections; a 
similar difference is observed for verbose queries. In contrast, for VN-Okapi, the pa¬ 
rameter sensitivity of b on GOV2 is highly similar to that of other collections. The best 
performance values of b are not different across all collections; the best values of b are 
commonly between 0.05 and 0.1, for short keyword queries, between 0.2 and 0.3 for 
short verbose queries, and between 0.3 and 0.4 for long verbose queries. 

A comparison of the best performances indicates that VN-Okapi is slightly better 
than Okapi, in that it highlights the small magnitude of the increase in MAP. Despite 
its small magnitude, on ROBUST and GOV2, the improvements over Okapi using VN- 
Okapi are statistically significant for all three types of queries. 

6.3. MRF vs. VN-MRF 

For evaluating MRF and VN-MRF, because we adopt sequential dependence, a depen¬ 
dency link (undirected link) is inserted only between two adjacent query words. Unlike 
in the case of other query types, for a long verbose query, we do not put a dependency 
across different topic fields. Thus, no dependency appears between a query word in the 
title fiel d an d a query word in the description or the narrative fields. 

Table IXlIl shows the comparative results of MRF and VN-MRF under three different 
scope measures, relative to those of DP and VN-DP using EntropyPower. 

It is clearly seen that MRF is always better than DP, with all of the performance im¬ 
provements being statistically significant. This p recisely reproduces the comparison 
results reported by the existing works on MRF [Metzler and Croft 200511 . Note that 
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Table XII. MAP performance comparison of MRF and VN-MRF on three col¬ 
lections, ROBUST, WT10G, and GOV2, relative to that of DP and VN-DP. 
The symbols a, 0, and 7 indicate that a run of the VN method shows statis¬ 
tically significant improvement in the t-test at 0.95 confidence level, over DP, 
VN-DP, and MRF, respectively. 



Method 

MRF (or VN-MRF) 

ROBUST 

WTIOG 

GDV2 

sk 

baseline (UP) 

0.2447 

0.1963 

0.2907 

baseline (VN-UP) 

0.2481“ 

0 . 2120 “ 

0.3099“ 

baseline (MKP) 

0.2545“T' 

0.2149“ 

0.3095“ 

LengthPower(0.5) 

0.2506 

0.2055 

0.3032 

LengthPower(0.7 5) 

0.2557“^" 

0.2128“ 

0.3133“T' 

LengthPower(0.9) 

0.2545“^ 

0.2142“ 

0.3125“^ 

UniqLength 

0.2572“PT' 

0.2244“^" 

0.3270“!"^ 

EntropyPower 

0.2581“!^ 

0.2296“!"” 

0.3334“'" 

sv 

baseline (UP) 

0.2260 

0.1909 

0.2455 

baseline (VN-UP) 

0.2440“ 

0.2196“ 

0.2826“^" 

baseline (MKP) 

0.2416“ 

0.2063“ 

0.2687“ 

LengthPower(0.5) 

0.2545““^" 

0.2197“ 

0.2810“T' 

LengthPower(0.7 5) 

0.2507“^^ 

0 2147«7 

0.2782“T' 

LengthPower(0.9) 

0.2458“^^ 

0.2125“'!' 

0 2739^7 

UniqLength 

0.2500“^^ 

0.2214“^ 

0.2879“T' 

EntropyPower 

0.2550“'^ 

0.2368“!" 

0.2975“'" 

Iv 

baseline (UP) 

0.2707 

0.2469 

0.2864 

baseline (VN-UP) 

0.2799“ 

0.2614“ 

0.3248“ 

baseline (MKP) 

0.2813“ 

0.2613“ 

0.3164“ 

LengthPower(0.5) 

0.2866“^':' 

0.2581 

0.3368“!"'!' 

LengthPower(0.7 5) 

0.2883“^':' 

0.2659“ 

0.3280“T' 

LengthPower(0.9) 

0.2861“'" 

0.2617“ 

0 3214«7 

UniqLength 

0.2895“'" T" 

0.2687“^" 

0.3363“!"^ 

EntropyPower 

0.2927“'"^ 

0.2754“!"^ 

0.3481“!" 


the improved performance using MRF is further enhanced hy VN-MRF with the appli¬ 
cation of the two-stage normalization, and additional improvements are statistically 
significant improvements. In particular, either on UniqLength or EntropyPower, VN- 
MRF is always better than MRF for all test collections and all query types, with all 
improvements being statistically significant. 

Interestingly, VN-DP alone without exploiting the term dependency is nearly com¬ 
parable to MRF, often even showing better performances. Again, the performance of 
VN-DP is further increased by VN-MRF along with the utilization of the term depen¬ 
dency, and the additional improvements are statistically significant in most cases, at 
least using either EntropyPower or UniqLength. Therefore, this result strongly implies 
that both effects resulting from the term dependency and the two-stage normalization 
are slightly co-related, thus facilitating such incremental increase by their combined 
utilization. 

Another interesting result is that the performance difference of VN-MRF across 
test collections and query types shows a highly similar tendency to that of VN-DP. 
First, both VN methods (VN-MRF and VN-DP) are more effective, especially on the 
heterogeneous web collections (WTIOG and GOV2) than on ROBUST. Second, on En¬ 
tropyPower, both VN methods show larger improvements for verbose queries than for 
keyword queries - the only exception is found in VN-MRF for long verbose query 
on WTIOG, where the improvement is slightly smaller than that for short keyword 
queries. Third, on LengthPower, both VN methods often show improvements over their 
original methods, and they are more effective for verbose queries than for keyword 
queries. 
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Table XIII. Comparison of performance of P@5 of MRF and VN-MRF for 
three collections, ROBUST, WT10G, and GOV2, relative to that of DP and 
VN-DP. The symbols o, /3, and 7 indicate that a run of the VN method 
shows statistically significant improvement in the t-test at 0.95 confidence 
level, over DP, VN-DP, and MRF, respectively. 



Method 

MRF (or VN-MRF) 

ROBUST 

WTIOG 

GDV2 

sk 

baseline (UP) 

0.4924 

0.3120 

0.5678 

baseline (VN-UP) 

0.4972 

0.3640“ 

0.6416 

baseline (MRP) 

0.5036 

0.3580“ 

0.6121 

LengtliPower(0.5) 

0.4859 

0.3540“ 

0.5664 

LengthPower(0.75) 

0.4916 

0.3500“ 

0.6054“ 

LengthPower(0.9) 

0.5004 

0.3540“ 

0.6121“ 

UniqLength 

0.5068 

0.3660“ 

0.6470“^" 

EntropyPower 

0.5012 

0.3840“'^^ 

0.6685“^^ 

sv 

baseline (UP) 

0.4466 

0.3880 

0.5208 

baseline (VN-UP) 

0.4932“ 

0.4300“ 

0.6309 

baseline (MKl'’) 

0.4876“ 

0.4240“ 

0.5839 

LengthPo wer(0.5) 

0.4916“ 

0.4140 

0.5544 

LengthPower(0.75) 

0.4972“ 

0.4160 

0.5785“ 

LengthPower(0.9) 

0.4892“ 

0.4120 

0.5812“ 

UniqLength 

0.4940“ 

0.4320“ 

0.5919“T' 

EntropyPower 

0.5044“T^ 

0.4400“ 

0.6376“T^ 

Iv 

baseline (UP) 

0.5414 

0.4460 

0.6228 

baseline (VN-UP) 

0.5631“ 

0.4700 

0.6644“ 

baseline (MKr') 

0.5598“ 

0.4700“ 

0.6550“ 

LengthPower(0.5) 

0.5582 

0.4680 

0.6336 

LengthPower(0.75) 

0.5639“ 

0.4720“ 

U.bdVt) 

LengthPo wer(0.9) 

0.5655“ 

0.4580 

0.6456 

UniqLength 

0.5751“^ 

0.4760“ 

0.6671“ 

EntropyPower 

0.5799“'^^ 

0.4640 

0.6886“'f 


The similarity between the two VN methods is understandable considering the fact 
that the underlying retrieval function in MRF is basically the same as that of DP - 
DP and MRF commonly employ the smoothed document model of Eq. ([6ll for scoring a 
docume nt. 

Table IXIIII shows the performances of P@5 of VN-MRF, in comparison to those of 
MRF, using the values of the MAP-optimized p aram eters. The results for P@5 are 
largely similar to those for MAP, as seen in Table IXIII In many cases, the MRF’s per¬ 
formance of P@5 is better than that of DP, often with statistically significant improve¬ 
ments. Further, the performance of MRF is increased by VN-MRF with two-stage nor¬ 
malization, at least using either UniqLength or EntropyPower, and often with statis¬ 
tically significant improvements. In the particular case using EntropyPower, VN-MRF 
yields statistically significant improvements over MRF on WTIOG and GOV2 for all 
short keyword queries and on ROBUST and GOV2 for some verbose queries. This re¬ 
sult implies that in many cases, VN-MRF’s significant improvement in MAP results 
from the increased performance of P@5. VN-DP alone shows performance similar to 
that of MRF. Again, the precision of VN-DP is slightly increased by VN-MRF exploit¬ 
ing the term dependency, although usually not to a statistically significant degree, 
unlike the results in MAP. 

For further comparison. Figure [2] shows the performance curves of MRF and VN- 
MRF, plotted by varying p, for short keyword and verbose queries - EntropyPower is 
used as the scope measure, and MAP and P@5 are used as the evaluation measures. 
Again, there is a great degree of similarity between the comparison results of VN-MRF 
and MRF and those of VN-DP and DP - for P@5 curves, VN-MRF is always better than 
the original method, except for only a few parameter values of ps in ROBUST. The 
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shapes of the performance curves of P@5 are quite similar for hoth VN-MRF and MRF. 
The optimal ranges of are also close, as in the case of VN-DP and DP. 




(a) ROBUST (MAP) 


(b) ROBUST (P@5) 






0 2000 4000 6000 8000 10000 

b 


(c) WTIOG (MAP) 


(d) WTIOG (P@5) 



(e) GOV2 (MAP) 



Fig. 2. Performance curves (MAP and P@5) of MRF and VN-MRF obtained using EntropyPower with vary¬ 
ing /r on ROBUST (top), WTIOG (center), and GOV2 (bottom). 
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7. APPLICATION TO LOWER BOUNDING TERM FREQUENCY NORMALIZATION 
7.1. Lower-Bounded Retrieval Models 

As discussed in related works, IILv and Zhai 200^ recently proposed the use of lower- 
bounding term frequency normalization in avoiding over-penalization of very long doc¬ 
uments. Their experimental results showed that lower-bounded retrieval models lead 
to significant improvements in comparison with baseline models. An interesting issue 
is whether our proposed two-stage normalization can further improve these lower- 
bounded models. To this end, we chose DP and VN-DP as retrieval models, and com¬ 
pared their lower-bounded models with their VN models. 

Two lower-bounded models for DP and VN-DP are presented in the following. First, 
a lower-bounded model for DP can be formulated as follows: 


c(w,g) 

w^qDd 


In 



c(w,d) \ 


+ In 



tiPiw\C)J\ 


+ k|ln 



(16) 


which is called DP+. In Eq. ( I16D . d is a pseudo ter m frequency value th at controls the 
scale of the lower bound, which was introduced by BLv and Zhai 20lfbll . 

Second, a lower-bounded model for VN-DP ca n be formulated by straightforwardly 
applying the general normalization approach of BLv and Zhai 2011bP^ 


E 

w^qDd 


c{w,q) 


In 1 + 


c{w,d) s{d)^ 
fiP{w\C) ^) 


+ ln 



dPHC))\ 


+ k|ln 


d 


s{d) + fj. 


(17) 


which is called VN-DP+. 

Similarly, we can derive a lower-bounded Okapi (Okapi+) and a lower-bounded 
VN-Okapi (VN-Okapi+). In the original BM25 retrieval formula, Okapi-n uses 
tfBM 25 +{w, d) (i.e., tfBM 25 {w, d) + S) for the term frequency component, and VN-Okapi-n 
uses tfBM 2 b{w, (j){d)) (i.e., tfBM 2 b{w, 4>{d)) + S) which are given by 


tfBM25+{w, d) 


(fci + l)c(w, d) 

ki ((1 — b) + b\d\/avgl) + c{w, d) 


(18) 


tfBM25+{u>,<j){d)) 


(fci -I- l)c(w, d) 

ki\d\ (((1 — b)/s{d)) + b/avgs) + c(w, d) 


(19) 


7.2. Experiment Results 

Fo r parameter tuning of the lower-bounded models, we follow the setting in the work 
of BLv and Zhai 200^ : For DP-h and VN-DP-n, we search 5 over the space between 0 
and 0.15, with increments of 0.01. For Okapi-H and VN-Okapi-n, we search 5 over space 
between 0.0 and 1.5, with the increment of 0.1. The search spaces for other retrieval 
parame ters a re the same as those used in previous sections. 

Table IXTVl shows the MAP per form ances of DPh- and VN-DP-n, as compared to DP 
and VN-DP. As shown in Table IXIVl DPh- often exhibits non-trivial improvements 
over DP, especially for short verbose queries, reaffirming the results reported in 
BLv and Zhai 20lTE]| that DPh- shows greater effectiveness for short verbose queries 
than for keyword queries, which shows a different result from that achieved by 


According to the notation of ILv and Zhai 2011bl . F(c(w,<j>{d)),\cl>(d)\,td{w)) corresponds to 

In f M 4_ e(ii)Xci)) \ 

\s(d) + ^i (s{d) + ^l)P(w\C) J 
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Table XIV. MAP performance comparison of DP and VN-DP on 
three collections ROBUST, WT10G, and GOV2, and three dif¬ 
ferent query types sk, sv, and Iv. EntropyPower is used for the 
scope measure in VN-DP and VN-DP-r. Symbols o, 0 , and 7 in¬ 
dicate that a run of the VN method (or a lower-bounded method) 
shows a statistically significant improvement over DP, DP-i-, VN- 
DP, respectively, in the t-test at 0.95 confidence level. 



Method 

DP-h (or VN-DP-h) 

ROBUST 

WTIOO 

GDV2 

sk 

DP 

0.2447 

0.1963 

0.2907 

DP+ 

0.2447 

0.1957 

0.2922 

VN-DP 

0.2481“:^ 

0.2120“'" 

0.3099“/" 

VN-DP-h 

0.2476“P 

0.2112“'" 

0.3141“/" 

sv 

DP 

0.2260 

0.1909 

0.2455 

DP+ 

0.2337“ 

0.1969 

0.2453 

VN-DP 

0.2440“'" 

0.2196“'" 

0.2826“/" 

VN-DP-h 

0.2461“'" 

0.2215“/" 

0.2819“/" 

Iv 

DP 

0.2707 

0.2469 

0.2864 

DP+ 

0.2766 

0.2442 

0.2863 

VN-DP 

0.2799“ 

0.2614“/" 

0.3248“/" 

VN-DP-h 

0.2858“'" 

0.2603“/" 

0.3248“/" 


ILv and Zhai 20lTb]l : our experiment shows a statistically significant improvement 
when using DPh- over DP for only sv queries in the ROBUST collection. Unlike DP-n, 
VN-DP-h, the lower-hounded model over VN-DP, does not show greater effectiveness 
for short verbose queries than for other types of queries. This may he because VN-DP 
already shows a significant improvement over DP for short verbose queries, and fur¬ 
ther improvement is therefore less likely. Nevertheless, VN-DP-h continues to further 
increase the performances of VN-DP, with improvements being statistically significant 
for short keyword queries in GOV2 and long verbose queries in ROBUST. 

Importantly, on comparing VN-DP with DP-h, we can see that DP-h does not reach 
the performance of VN-DP. For almost all runs (except for Iv in ROBUST), the im¬ 
provements gained by VN-DP over DP are mostly larger than those made by DP-h 
over DP, and in most cases are statistically significant. Furthermore, VN-DP leads 
mostly to statistically significant improvements over DP-h for almost all runs. These 
results clearly demonstrate that the improvement from the VN model over DP is not 
redundant to the effects from the existing lower-bounding normalization, and leads 
to a significant improvement even against lower-bounded models, which are stronger 
baselines. Overall, our experimental results indicate that two-stage normalization sig¬ 
nificantly improves lower-bounded models for almost all runs for three different collec¬ 
tions. 

We now consider the comparison bet ween lower-bounded models for Okapi and VN- 
Okapi and their original models. Table IXVl lists the MAP performances of Okapi-H and 
VN-Okapi-H, as compare d to those of Okapi and VN-Okapi. Again, results similar to 
those presented in Table IXTVl are obtained, although the improvements by the VN mod¬ 
els over lower-bounded models are not larger than the case of DP; the lower-bounded 
models are effective in improving baseline models, without reaching the performance 
of VN-Okapi. The improvements gained by VN-Okapi over Okapi are mostly larger 
than those made by Okapi-H over Okapi. Although the improvements of VN-Okapi over 
Okapi-H are not statistically significant in most cases, VN-Okapi-n leads to statistically 
significant improvements over Okapi-H for almost all runs. 

For further comparison, we present the performances of original, VN, and lower- 
bounded models with respect to standard topic sets of TREC in three test collec¬ 
tions, named TREC6, TREC7, TREC8, ROBUST03, ROBUST04, TREC9, TREC 10, 
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Table XV. MAP performance comparison of Okapi and VN-Okapi 
on three collections ROBUST, WT10G, and GOV2, and three dif¬ 
ferent query types sk, sv, and Iv. EntropyPower is used for the 
scope measure in VN-Okapi and VN-Okapi-r. Symbols o, /3, and 7 
indicate that a run of the VN method (or a lower-bounded method) 
shows a statistically significant improvement over Okapi, Okapi-r, 
VN-Okapi, respectively, in the t-test at 0.95 confidence level. 



Method 

Okapi+ (or VN-Okapi+) 

ROBUST 

WTIOG 

GDV2 

sk 

Okapi 

0.2444 

0.1946 

0.2920 

Okapin 

0.2457“ 

0.2039“ 

0.2969“ 

VN-Okapi 

0.2477“ 

0.2071“ 

0.3004“ 

VN-Okapi+ 

0.2477“ 

0.2085“ 

0.3100“PT^ 

sv 

Okapi 

0.2247 

0.1884 

0.2498 

Okapin 

0.2279“ 

0.1900 

0.2573“ 

VN-Okapi 

0.2303“ 

0.1968“ 

0.2599“ 

VN-Okapin 

0.2311“P 

0.2023“T^ 

0.2658“PT^ 

Iv 

Okapi 

0.2619 

0.2314 

0.3012 

Okapin 

0.2640“ 

0.2320 

0.3059“ 

VN-Okapi 

0.2659“ 

0.2415“P 

0.3074“ 

VN-Okapin 

0.2658“ 

0.2390“" 

0.3094“P 


Table XVI. Standard topic sets of TREC, their corresponding collection names, and their train¬ 
ing topic sets. 


Topic set id 

Query ids 

Collection 

Training topic sets 

TREC6 

Q301-Q350 

ROBUST 

TREC7,TREC8,ROBUST03,ROBUST04 

TREC7 

Q351-Q400 

TREC6,TREC8,ROBUST03,ROBUST04 

TREC8 

Q401-Q450 

TREC6,TREC7,ROBUST03,ROBUST04 

ROBUST03 

y601-Q650 

TREC6,TREC7,TREC8,ROBUST04 

ROBUST04 

Q651-Q700 

TREC6,TREC7,TREC8,ROBUST03 

TREC9 

Q451-Q500 

WTIOG 

TRECIO 

TREC10 

Q501-Q550 

TREC9 

TREC2004 

Q701-Q750 

GOV2 

TREC2005,TREC2006 

TREC2005 

Q751-Q800 

TREC2004,TREC2006 

TREC2006 

Q801-(:J850 

TREC2004,TREC2005 


TREC2004, TREC2005, and TREC2006. Table IXWl presents the basic information on 
the standard topic s ets of TREC. 

First, Table IXVlIl shows the MAP performances between DP-t- and VN-DP+, as com¬ 
pared to DP and VN-DP on standard TREC topic sets. As shown in Table IXVlIl VN-DP 
or VN-DPh- show further improvements over DP and DPh- for almost all standard topic 
sets and they are statistically significant for more than half of all cases. In particular, 
VN-DPh- shows the best performance for almost all runs. Their improvements over DP 
are statistically significant (except for standard topic sets of sk queries in ROBUST 
and Iv queries in WTIOG) and are larger than the improvements of VN-DP or DPh- 
over DP. Comparing VN-DP to DPh-, more runs showed improvements of statistical 
significance on VN-DP over DP than on DPh- over DP. 

Turn ing to the comparison between lower-bounded and VN models for Okapi, Table 
IXVIIII shows the MAP performances between Okapin- and VN-Okapin-, as compared to 
Okapi and VN-Okapi on standard TREC topic sets. Again, VN-Okapin- shows the best 
performance for almost all runs, and their improvements over Okapi are larger than 
those made by VN-Okapi or Okapin- over Okapi, being statistically significant for most 
cases. 

Thus, the main results of Tables IXVIIl and lXVIIIl are largely consistent with those re¬ 
ported in Tables IXTVl and lXVl respectively. The lower bounding models do not reach the 
performance of VN models; the improvements gained by VN models over the baseline 
are mostly larger than those made by lower bounding models over the baseline; the fur- 
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Table XVII. MAP performance comparison of DP, DP+, VN-DP, and VN-DP+ on stan¬ 
dard topic sets in TREC and three different query types sk, sv, and Iv. EntropyPower 
is used for the scope measure in VN-DP and VN-DP-h. Symbols a, 0, and 7 indicate 
that a run of the VN method (or a lower-bounded method) shows a statistically signif¬ 
icant improvement over DP, DP-r, VN-DP, respectively, in the t-test at 0.95 confidence 
level. 




DP 

DP-h 

VN-DP 

VN-DP-h 

sk 

TREC6 

0.2465 

0.2471 

0.2483 

0.2479 

TREC7 

0.1733 

0.1733 

0.1785 

0.1783 

TREC8 

0.2410 

0.2415 

0.2425 

0.2425 

ROBUST03 

0.2756 

0.2755 

0.2818“'’ 

0.2812“'’ 

RUBUST04 

0.2879 

0.2871 

0.2903 

0.2892 

TREC9 

0.1985 

0.1997 

0.2063 

0.2068 

TREC 10 

0.1942 

0.1917 

0.2177“'’ 

0.2156“'’ 

TREC2004 

0.2597 

0.2605 

0.2790“'’ 

0.2842“'’ 

TREC2005 

0.3114 

0.3130 

0.3297“'’ 

0.3336“'’ 

TREC2006 

0.3005 

0.3026 

0.3205“'’ 

0.3228“'’ 

sv 

TREC6 

0.1751 

0.1898 

0.1980“ 

0.2018“'’ 

TREC7 

0.1698 

0.1768“ 

0.1887“'’ 

0.1904“'’ 

TREC8 

0.2145 

0.2194 

0.2301 

0.2306“'’ 

ROBUST03 

0.2912 

0.2996 

0.3171“'’ 

0.3182“'’ 

ROBUST04 

0.2806 

0.2840 

0.2872 

0.2906“ 

TREC9 

u.iyyt) 

0.2094“ 

0.2299 

0.2325“ 

TRECIO 

0.1822 

0.1844 

0.2093“'’ 

0.2105“'’ 

TREC2004 

0.2163 

0.2154 

0.2498“'’ 

0.2475“'’ 

TREC2005 

0.2524 

0.2524 

0.2860“'’ 

0.2860“'’ 

TREC2006 

0.2671 

0.2676 

0.3114 “'’ 

0.3114“'’ 

Iv 

TREC6 

0.2627 

0.2792“ 

0.2713“ 

0.2904“'’ 

TREC7 

0.2152 

0.2254“ 

0.2219“ 

0.2296“'’ 

TREC8 

0.2421 

0.2511“ 

0.2539“ 

0.2619“'’ 

ROBUST03 

0.3289 

0.3252 

0.3391“'’ 

0.3401“'’ 

ROBUST04 

0.3051 

0.3063 

0.3106 

0.3074 

TREC9 

0.2550 

0.2550 

0.2675 

0.2675 

TRECIO 

0.2388 

0.2334 

0.2553“'’ 

0.2530 

TREC2004 

0.2650 

0.2651 

0.2943“'’ 

0.2943“'’ 

TREC2005 

0.2875 

0.2870 

0.3232“'’ 

0.3232“'’ 

TREC2006 

0.3062 

0.3062 

0.3563“'’ 

0.3563“'’ 


ther improvements even against lower bounding models are made by lower bounding 
VN models (VN-DP-h or VN-Okapi-H)B 

As a consequence, the overall results shown in Table IXTVl IXVl IXVlIl and I^VTIIl con¬ 
sistently indicate that the improvement resulting from the application of two-stage 
normalization is not fully replaceable by adopting lower-bounding term frequency nor¬ 
malization, and vice versa. This result is intuitive, as two normalizations aim at differ¬ 
ent deficiencies of the existing normalization method: lower-bounding term frequency 
normalization aims at avoiding the penalization of very long documents, while two- 


However, compared to the previous experiments, the statistical significance for VN models is weakly 
supported on some sets of standard queries for both the DP and Okapi cases. The result of the statistical 
significance is also fairly observed in lower bounding models where the improvements are not statistically 
significant on some of the standard TREC topic sets. We believe that the reason is the lack of evidence for by 
which to judge significance. The number of queries used in standard topics is 50, which is often not sufficient 
to convincingly decide significance, especially when the im provem ent is marginal. Consider, for example, the 
case of Okapi and sk queries in ROBUST (shown in Tablej^Vln). When using only 50 queries in standard 
TREC, the improvements made by VN-Okapi and Okapi-H over the baseline are mostly not of statistical 
significance. However, when usin g 250 queries in ROBUST, the improvements turn out to be statistically 
significant as shown in Table IXVl Thus, the results show that a larger number of queries might be necessary 
for the statistical significance test when the improvements are marginal. 
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Table XVIII. MAP performance comparison of Okapi, Okapi+, VN-Okapi, and VN- 
Okapi+ on standard topic sets in TREC and three different query types sk, sv, and Iv. 
EntropyPower is used for the scope measure in VN-Okapi and VN-Okapi+. Symbols 
a, 0, and 7 indicate that a run of the VN method (or a lower-bounded method) shows 
a statistically significant improvement over Okapi, Okapi-r, VN-Okapi, respectively, in 
the t-test at 0.95 confidence level. 




Okapi 

Okapi+ 

VN-Okapi 

VN-Okapi+ 


TREC6 

0.2471 

0.2479 

0.2492 

0.2498 


TKEC7 

0.1734 

0.1754 

0.1758 

0.1773“ 


TKEC8 

0.2413 

0.2415 

0.2445 

0.2416 


ROBUST03 

0.2760 

0.2786 

0.2826“ 

0.2833“ 

sk 

ROBUST04 

0.2848 

0.2861 

0.2872 

0.2874 

TKEC9 

0.1946 

0.2077“ 

0.2050 

0.2089 


TKECIO 

0.1946 

0.2001 

0.2092“ 

0.2081 


TREC2004 

0.2562 

0.2606 

0.2659 

0.2766“'^^ 


TREC2005 

0.3222 

0.3297“ 

U.y233 

0.3354“'' 


TREC2006 

0.2969 

0.2996 

0.3113“ 

0.3174“'^ 


TREC6 

0.1628 

0.1663 

0.1678 

0.1678 


TREC7 

0.1603 

0.1642“ 

0.1653“ 

0.1702“PT^ 


TREC8 

0.2111 

0.2144 

0.2154 

0.2172“ 


ROBUST03 

0.3148 

0.3145 

0.3196“ 

0.3159 

sk 

ROBUST04 

0.2753 

0.2809“ 

0.2847 

0.2847“ 

TREC9 

0.1958 

0.1970 

0.2055 

0.2117“^ 


TRECIO 

0.1809 

0.1831 

0.1880 

0.1930“ 


TREC2004 

0.2240 

0.2302 

0.2327“ 

0.2401“PT^ 


TREC2005 

0.2540 

0.2682“ 

0.2597 

0.2714“T^ 


TREC2006 

0.2707 

0.2731 

0.2866“ 

0.2855“'" 


TREC6 

0.2365 

0.2360 

0.2366 

0.2374 


TREC7 

0.2077 

0.2104“ 

0.2093 

0.2093 


TREC8 

0.2420 

0.2434 

0.2448 

0.2451 


ROBUST03 

0.3287 

0.3331 

0.3378 

0.3381“ 

Iv 

ROBUST04 

0.2954 

0.2980 

0.3016“ 

0.3000“ 

TREC9 

0.2248 

0.2281“ 

0.2413“P 

0.2362“'" 


TRECIO 

0.2379 

0.2359 

0.2418 

0.2418 


TREC2004 

0.2695 

0.2710 

0.2743 

0.2746 


TREC2005 

0.3056 

0.3146“ 

0.3083 

0.3164“ 


TREC2006 

0.3280 

0.3315 

0.3389“ 

0.3365“'" 


stage normalization aims at avoiding insufficient penalization of verbose documents 
and excessive penalization of long documents. The resultant solutions for these differ¬ 
ent goals also differ: lower-bounding term frequency normalization does not need to 
decompose the document length into additional factors, but rather enforces the addi¬ 
tion of scope gaps between document scores when a query term appears and disappears 
in a document. Two-stage normalization decomposes a document length into verbosity 
and scope factors, penalizing verbose and broad documents separately. 

In conclusion, given the set of results described throughout this section, we can 
state the following. The proposed two-stage normalization is clearly effective for fur¬ 
ther improving the existing retrieval model; it is not limitedly applicable to improving 
the baseline retrieval model and can also be extended even to the improved methods 
that uses the term dependency and the lower-bounding term frequency normalization. 
From the comparative axiomatic analysis results, we conclude that the normalization 
heuristics HI and H2 should necessarily be applied for scoring a document. 


8. CONCLUSION 

In this paper, we argue that a normalization function should use different penaliza¬ 
tions for verbosity and scope, and we propose the use of two-stage normalization. Our 
main contributions over and above those of previous works that formulated ranking 
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functions belonging to two-stage normalization [ Singhal et al. 19^ INa et al. 2008a]l 
are as follows: 1) We generalize two-stage normalization such that it can be applied 
to any retrieval model. 2) We perform comparative axiomatic analysis and capture the 
exact retrieval heuristics resulting from two-stage normalization and its difference 
from the original method. The results of experiments on three models - DP, Okapi, 
and MRF - consistently show that two-stage normalization is promising. 

Of course, considerable work needs to be done in the future. Although 
two-stage normalization is effective in the case of DP, Okapi, and MRF, 
we still need more evidence of its effectiveness for other retrieval mod¬ 
els. Thus, an obvious future work is to explore the application of two- 
stage normalization to the pivoted vector space model ||Singhal et al. 1996||, DFR 


l Amati and Van Rijsbergen 20021 |Amati and Rijsbergen 2062| |, a more recently devel- 

oped information model |[Ohnchant and Gaussier 201011 . a parameterized query expan- 
sion [Bendersky et al. 2011 1, a term-specific adaptati on of normalizatio n parameter 
IILv and Zhai 2011all . or alearning-to-rank framework BLiu 20091ILi 20lT1 . Another re¬ 
search direction is strengthening the axiomatic framework by generalizing the cur¬ 
rent retrieval constraints such that they can effectively cover the conjectured retrieval 
heuristics derived in this paper. A more challenging future research direction is to 
develop a new retrieval model in an innovative manner such that it includes the ver¬ 
bosity, scope, and document length as retrieval parameters. 
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Appendix A: Definition of Verbosity 

For full definition of verbosity, suppose that M is the total number of topics in a collec¬ 
tion, and s{d) is the number of topics mentioned in d (or the expected number of topics 
in d). Here, we assume that the topic is countable, which may refer to an individual 
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word or a concept. Given document d, we first define the topic-specific verbosity of doc¬ 
ument d, noted v{t,d), which is the sum of frequencies of all words which belong to 
t: 

v{t,d) = Y, c{w,d)P{t\w) (20) 

where P{t\w) is the posterior prohahility that w comes from t (i.e., P{ti\w) = 1). 

Under Eq. ( l20l ). we readily show that v{t, d) is the length of the passages in d which 
belong to t. 

By the definition above, we further show that the following equality holds: 


M 

Ml = '^v{U,d) 

= v{ti,d)-\ - \-v{tM,d) (21) 

To simplify Eq. dMI ). note that v{t, d) = 0 for most topics, as documents usually 
cover only a few topics. Let U ■ • • ifc • • • is{d) be indexes of topics appearing in d where 
v{ik,d) > 0. Then, |(i| is reformulated as 

|d| = v{ti^,d)-\ -(22) 

Now, v{d), the verbosity of document d, is defined as the average of all per-topic 
verbosities computed for all s{d) topics appearing in d, which is given by: 


v{d) 


v{U^,d)p--- + v{ti^^^^,d) 
s{d) 

\± 

s{d) 


(23) 


Thus, the average verbosity is the document length divided by the number of topics 
s{d), which exactly replicates Eq. (Ell. Given Eq. ( l23l l. we only require s(d), without need 
to estimate v{t, d) which are usually unseen and hard to compute. 


Appendix B: On Document-Specific Conjugate Prior for VN-DP 

The use of a document-specific conjugate prior for VN-DP (i.e., Eq. (ISjl) is derived from 
the verbosity hypothesis. For convenience of discussion, suppose that c'{w, d) indicates 
the unseen frequency of w in d. Generally, smoothing uses Cs{w, d) defined as c{w, d) -L 
c'{w, d) as a count of w in d, thereby estimating P(ty|d) as Cs{w, d)/J2we:V ^siw, d). 

In DP, it is assumed that the pseudo length g is distributed over unseen words ac¬ 
cording to P{w\C), resulting in c'{w, d) = pP{w\C). In our case, however, because doc¬ 
ument length is decomposed to verbosity and scope, we need to introduce two pseudo 
factors for unseen words: verbosity and scope. Formally, let v'{d) and s'{d) be the ver¬ 
bosity and scope of an unseen part of d, respectively. Just like the formula of frequen¬ 
cies of seen words, which is given by c{w,d) = v{d)s{d)Pmiiw\d), frequencies of unseen 
words are formulated as c'(w, d) = v'{d)s'{d)P{w\C). 

To determine v'{d) and s'{d), we use the following assumptions: 

1. Verbosity of an unseen part: given a document, the verbosity of unseen passages 
(i.e., consisting of all unseen words) is the same as verbosity of the document. - The 
assumption is due to the verbosity hypothesis; c{w, d) is mostly governed by v{d). Just 
as the verbosity hypothesis is applied to seen words, we apply the verbosity hypothesis 
to unseen words. This results in c'{w, d), the frequencies of unseen words, which should 
also be governed by v{d). 
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2. Scope of an unseen part: given a document, the unseen scope of passages (i.e., 
consisting of all unseen words) is independent of the scope of the document. - Unlike 
verbosity, we do not make a document-specific setting for the scope, as the relation 
between the unseen scope of d and v{d) is not very clear. 

Under these assumptions, we have v'{d) = v{d), and s'{d) = g, thus resulting in 
c'{w, d) = pv{d)P{w\C), which leads to our use of a document-specific prior in Eq. ([Hi. 

Therefore, the difference in formulating c'{w, d) between DP and VN-DP results from 
whether or not we use the verbosity hypothesis for determining frequencies of unseen 
words. 


Appendix C: Comparative Axiomatic Anaiysis 

In this appendix, we briefly summarize the derivations of Ci, C 2 , and C 3 for Okapi and 
DP, where Ci and are necessary but not sufficient for satisfying the particular con¬ 
straint. Let di and ^2 be two given documents for LNCs and TF-LNC and A/(di, ^2, q) 
be f{di,q) — f{d2,q). All our derivations start from the inequality of Af{di,d2,q) > 0 
(or Af{dl, d 2 , q) > 0). For VN-DP, A/(dl, d 2 , g) > 0 is equivalent to 

s{di)p„,i{w\di) + pp{w\C) ^ s{d2)Pmi{u!\di) + dP{w\C) ^ 24 ) 

s{di) + p ~ s{d2) + p 

For VN-Okapi, Af{di,d 2 ,q) > 0 is equivalent to 


c{w,di) 

v{di) 


(^1-b + b 


avgs ) 


idf{w) > 


c{w,d2) 

v{d2) 


(^1-b + b 


avgs ) 


idf{w) 


(25) 


1) LNC1 

We first show the derivation of the conditions for LNCl under VN-DP and VN-Okapi. 
For the sake of convenience, we introduce the variables m and e that are defined as m 
= v{di)/v{d 2 ) and e = K/\di\, respectively. According to the definition of PAN, we have 
the following relation between s(di) and s{d 2 ): 

s{d2) = m(l-|-e)s((ii) 

s{d 2 )pmi{w\d 2 ) = m ■ s{di)pmi{w\di) (26) 

Below, we summarize the derivation for each case of VN-DP and VN-Okapi. 


i) VN-DP. We first simplify Eq. (l24l ) by replacing 3 (^ 2 ) with the terms to, e, and s(di) 
using Eq. ( l26l >. as follows: 


P {P 7 ni{w\di) -p{w\C)) > m {p {p,ni{w\di) -p{w\C)) - pep{w\C) - es{di)pmi{w\di)) 

(27) 

First, when pmiiwld) = p{w\C), it is easily shown that Eq. ( |27] > holds. Thus, we do 
not consider the equality case of Ai to simplify Eq. ( [27l l. 

Under the remaining cases of Ai (i.e., Pmi{_w\d) > p{w\C)), Eq. ( l27l l is equivalent to 


vid 2 ) ^ f. K p{w\C) + pmiiw\di)s{di)p 
v{di) ~ \ |di| pmi{w\di) - p{w\C) ) 

Under > P(^|C'), the right-hand side of Eq. ( [28] > is a decreasing function with 

respect to p{w\C), with the upper bound when p{w\C) = 0. After replacing the right- 
hand side with this upper bound, the necessary condition for Eq. ( l28l l is simplified to: 


v{d 2 ) ^ dC s{di) \ ^ ^ _ K 1 

v{di) ~ \ \di\ p ) v{di) p 


(29) 


which is satisfied if Ci is true, regardless of the choice of parameter p. 
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ii) VN-Okapi. As in the case of VN-DP, we replace s{di) with the terms m and e based 
on Eq. ( l26l ). simplifying Eq. ([25) to 


(1 — b)idf{w) > m 


(1 — 6) — be 


avgs J 


idf{w) 


First, when idf{w) = 0, it is clear that Eq. ( l30l l holds. 

Second, when idf (w) > 0 (under Ai), Eq. ( IMl ) is further rewritten as 


which is equivalent to 


1 > 1 - b-s{di) 

m ~ (1 — b)avgs 


(1 — b)v{di) ■ avgs 


b-K 

(1 — b)avgs 


> v{di) 


v{d2) 


(30) 


(31) 


(32) 


Thus, LNCl is satisfied if Ci is true. 


2) LNC2 

We could straight forwardly derive that LNC2 is equivalent to C 2 . To simplify the 
notation for the derivation, we introduce p =Pmi{w\di) = Pmi{.w\d 2 )', the equality holds 
because of the characteristic of PLS. We summarize the derivation of C 2 for each of 
VN-DP and VN-Okapi. 


i) VN-DP. For VN-DP, Eq. ( I24D is simplified to 

s{di)p pp{w\C) ^ s{d 2 )p pp(w\C) 
s{di) p ~ s{d 2 ) + p 


(33) 


It is trivial to show that the necessary and sufficient condition for LNC2 is C 2 , if Ai 
holds: 


ii) VN-Okapi. For VN-Okapi, Eq. 

ps{di) (k (1 — bb ^ 


is simplified to 


idf(w) > ps{d 2 ) [kil — b-\-b 


s(d2) 


avgs J J \ \ avgs 

Using idf{w) > 0 from Ai, Eq. ([34) is equivalent to C 2 . 


idf{w) 


(34) 


3) TF-LNC 

For the sake of convenience, we introduce variables m' and e' by putting m' = 
v{d 2 )/v{di) and e' = if/|c? 2 |- According to the definition of PAR, 

s{di) = to'(1 -I- e')s{d 2 ) 

s{di)pmi{'w\di) = m! {pjni{w\d 2 )e') s{d 2 ) (35) 

Below, we summarize the derivation for each of VN-DP and VN-Okapi. 
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i) VN-DP. We first simplify Eq. ( I24l > by replacing s{di) with the terms m', s', and 5 (^ 2 ), 
as follows: 


, f e '{1 - Pml{w\d 2 )sid 2 )) , , I , X , 

m ( - - - \-{Pmi{'w\d 2 ) - p{w\C)) + e [1 - p{w\C)) \ 

> (,Pml{w\d 2 ) -p{w\C)) (36) 

Whenpmi(w|d) = p{w\C), it is easily shown that Eq. ( l36l l holds. 

Whenpmi(wjd) > p{w\C), in the remaining cases of Ai, Eq. ( l36l l is equivalent to 

vjdl) ^ 1 I {l-p{w\C)) + {l-pjni{w\d 2 ))s{d 2 )p~^ 
v{d 2 ) \d 2 \ Pml{w\d 2 ) - p{w\C) 

For pmi{w\d) > p{w\C), the right-hand side of Eq. ( l37] > is an increasing function with 
respect to p{w\C), with the lower bound when p{w\C) = 0. In addition, we can further 
lower the bound by eliminating {l—p{w\di))s{d 2 )/fi because it is a positive value. Thus, 
we obtain the following necessary condition for Eq. ( l37l ): 


which is equivalent to 


^(^ 1 ) ^ ^ K 
v{d2) ~ c{w,d2) 


(38) 


ii) VN-Okapi. As in the case of VN-DP, we replace s{di) in the terms m', s', and 5 (^ 2 ) 
using Eq. dSSl * simplifying Eq. ([25) to 

m (s' (1 -Pmiiwld)) + {Pmi{w\d) + s') (1 - b)] idf{w) 

\ avgs ) 

> Pmi{w\d){l - b)idf{w) (39) 

Under Ai, because idf{w) > 0, Eq. ( l39l ) is equivalent to 


r(dl) , K {l-b) + {l-pral{w\d 2 ))b^-^ 

v{d2) \d2\ Pm.l{w\d2){l - b) 

A lower bound for the right-hand side Eq. ( I40I I is obtained by eliminating (1- 
Pmi{w\d))bs{d 2 )!avgs), which is a positive value. Therefore, the necessary condition for 
Eq. ( |4^ becomes 


which is equivalent to C 3 0 . 


^ ^ ^ K 1 
v{d 2 ) ~ \d 2 \Pml{w\d 2 ) 


c{w,d 2 ) 


(41) 


Appendix D: Analysis Result of TF-LNC under UniqLength and LengthPower 

TF-LNC is true when using UniqLength and LengthPower. To prove this, let be 
s((ii) — s{d 2 )- First, when Ag > K/v(d 2 ), it is equivalent to v{di) < v{d 2 ), and thus, Ca 
is satisfied. Otherwise (i.e., Ag < K/v{d 2 )), C 3 is equivalent to 


c{w, d 2 ) < K 


v{d2) 

v{di) - v{d 2 ) 


^^ 1^21 -t Ag ■ v{d2) 
K - /As - v{d2) 


(42) 


i®Note that Eq. <381 can contain the equality condition, because p{w\C) > 0 (even whenp(u;|C) —>■ 0) 
Note that Eq. <411 can allow the equality condition. 
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The term on the right-hand side of Eq. ( I42D is a decreasing function with respect to Ag. 
For the cases of UniqLength and LengthPower, Ag > 0, regardless of if; therefore, Eq. 
(SUl is satisfied if c{w,d 2 ) < 1 ^ 21 (i.e., obtained by using Ag = 0 in the right-hand side 
of Eq. ( I42I I). which is true for all cases: 


Appendix E: Analysis Result of TF-LNC under EntropyPower 

In this appendix, for VN models using EntropyPower, we show that TF-LNC is satisfied 
if Ca is true. To prove this, let Ag be s(di) — s{d 2 ). When Ag > K/v(d 2 ), it is equivalent 
to v{di) < v{d 2 ), and thus, C 3 is satisfied. Otherwise, C 3 is equivalent to:. 


^ K {c{w,d 2 ) - \d 2 \) 

* “ {c{w, (^ 2 ) + K) v{d 2 ) 

Eq. ( I43D is further simplified to: 


(43) 


s{di) ^ c{w,d2) \d2\+K 

s{d 2 ) ~ c{w, d 2 ) -L 1 1 ^ 21 

Using the definition of EntropyPower, we can rewrite log(s(di)) and log(s(d 2 )) as: 

log{s[d 2 )) = - V logc(w',d 2 ) + log M 2 1 (45) 


log(s(di))= -^^|^^^^^r^log(c(u;,d 2 )-Lif) 

\d 2 \+K 

X! -^-r^^^logc(w;',d2)-Llog|(i2| (46) 

w' ^d2,w'^W 


Substituting Eqs. ( l46l l and ( l45l l in Eq. dSl l. we now obtain the following condition for 
TF-LNC: 


log(s(di)) - log {s{d 2 )) 


c(w, ^ 2 ) + K 
\d2\+K 
M 2 I+N-I 


\d2\m+K) 


log {c{w,d 2 ) +K) 
c(w,d2)\og {c{w,d2)) 


+ 1 , /, r. (log M2 1 - log s{d2)) 

Mzl -L K 

+ log (M2 1 +K) - log (M2 1 ) 


(47) 


From the definition of TF-LNC, it is clear that once Eq. ( I47D holds for K = 1, then 
Eq. ( I47I I also holds for every K. Therefore, here, we only consider A" = 1. When AT = 1, 
Eq. ( I47I I is further simplified to: 


M 2 1 - c{w,d2) 

M 2 I + 1 


log {c{w,d 2 ) +K)- 


which leads to: 


-I- 1 - c{w,d 2 ) 

M 2 I + 1 

1 

log 


\ogc{w,d 2 ) 


s(d2) 


2I + 1 ^ M 


(48) 


(M2I 


c{w,d 2 ))\og 



2 {w,d 2 ) 


logc{w,d 2 ) > log 


5 (^ 2 ) 

M2I 


(49) 
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Seung-Hoon Na 


Applying (1+ x)” > (1 + nx) to the first term of the left-hand side of Eq. ( |4^ . we obtain 
the following sufficient condition for Eq ( I49D : 

\c[w,d2) J \d2\ 

which is rewritten as 


2 (log |d2| - logc(w,d2)) > logs(d2) 

Therefore, we finally obtain the condition: 

M2I 


3(^2) < 


c{w,d2) 


which is equivalent to C4. 


(51) 

(52) 
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